Most analysis methods cannot be performed if there are missing values in the data. If you are going on to study your data by clustering, you may need to put different genes on a single scale of variation. Moreover, missing values may prevent proper classification and poor substitution schemes for missing values may cause classification errors. If all the values substituted are determined by the most likely value, then the individual values are less likely to help define class (cluster) boundaries. Missing data values can negatively impact discovery results, and errors or data skews can proliferate across subsequent runs and cause a larger, cumulative error effect.
CodeLinker provides you with tools to remove missing values filter and normalize your data. Filtering provides a number of gene prioritization options. The processes generally take a large number of genes and apply selection criteria so that the output includes fewer genes. Some methods remove all of the genes that do not meet specified criteria while others allow you to specify the number of genes that will be left after the filtering. In CodeLinker, the term normalization is used to describe scaling, translation, or any other numerical transformation of the data besides filtering. Normalizations which may accomplish this include logarithm, standardization, division by maximum, and scaling between 0 and 1.
CodeLinker provides you with a rich set of tools for analyzing, exploring and visualizing your microarray or RNA-Seq data. Choose from K-Means Clustering, Jarvis-Patrick Clustering, Agglomerative Hierarchical Clustering and SOMs (Self Organizing Maps), as well as PCA (Principal Component Analysis) clustering algorithms.
To visualize your analyses, create specialized plots tuned for the algorithms you used to perform the original analysis. Each plot type can be customized, and you can export your plot in PNG, SVG or PDF formats.
Clustering is the name given to the task of grouping items together in such a way, that those within the cluster are more similar to each other than those in other clusters. With CodeLinker you can choose from K-Means, Jarvis-Patrick, or Agglomerative Hierarchical Clustering. Once you have applied your chosen clustering algorithm you will want to use one of CodeLinker’s Clustering Plots.
You can begin your visualizations using the Color Matrix Plot. This enables you to easily visualize the values in a dataset. Other plots include the Scatter Plot which is used for the pair-wise comparison of two samples or two genes. The plot visually determines those genes that show significant induction or repression. However, if you want to focus on just a few genes or a few samples then the Coordinate Plot is great way to look at the expression profile over all your test samples or a sample’s expression over the genes of interest. With the Centroid Plot you can visualize the exemplar profile for each of the clusters arising from the algorithm you employed. The Cluster Plot is to display the profiles of individual members within a cluster. Using this plot type it is possible to drill down into the clusters and view the individual member profiles.
Tree plots visually highlight clustering relationships and CodeLinker gives you two types of tree plot. The Matrix Tree Plot is a combination of a tree plot and color matrix. The tree produced is a close reflection of the algorithm that generated it such that closely related genes tend to appear beside each other in the diagram. The Two Way Matrix Tree Plot is useful for visualizing the results of two clustering experiments simultaneously. One must be based on genes, and the other on samples and both must be derived from the same original dataset.
SOM (SELF ORGANIZING MAP) PLOTS
The SOM or Self Organizing Map is a clustering algorithm that is used to map a multi-dimension data set onto a two-dimensional surface such as a plot. The goal is to uncover some the underlying structure of the data by grouping similar data items together.
There are three SOM plots to choose from. The SOM Centroid Plot is used to see the profiles of the values in a datasets associated with a particular node. The SOM Cluster Plot lets you drill down into the SOM cluster to view individual member profiles. While the SOM Matrix Tree Plot uses a color matrix of values (typically gene expression levels) in conjunction with a tree plot.
PCA (PRINCIPAL COMPONENT ANALYSIS) PLOTS
PCA is a powerful technique used mostly in the exploratory analysis of data and in predictive modeling. CodeLinker lets you visualize your analyses with six different plots.
The Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component. Such a plot can often show a clear separation in fraction of total variance where the most important components cease and the least important components begin. The plot is called a Scree Plot because it often looks like a scree slope, where rocks have fallen down and accumulated on the side of a mountain. The Score Plot involves projecting the data onto the Principal Components in two dimensions. Since the Principal Components were computed in a fashion which best carries the variation in the original data it is often easier to see structure in your data with this plot than it would be in the original data. The 3D Score Plot is a scatter plot where the x, y and z axes represent individual Principal Components. The points in the plot represent the original data projected onto the individual Principal Components.
Three closely related plots are the Loadings Color Matrix Plot, the Loadings Line Plot and the Loadings Scatter Plot. These display the individual elements of the Principal Components. The term Loadings refers to the extent to which the original variables (gene or sample) influence the Principal Component. The Loadings Color Matrix Plot displays these loadings as a tiled grid of colored cells. The Loadings Line Plot displays the loadings as a connected line graph. While the Loadings Scatter Plot displays the loadings in a scatter plot of one selected Principal Component versus another selected Principal Component.
To visualize your analyses, there are specialized plots tuned for the algorithms you used to perform the original analysis. Each plot type can be customised and you can export your plot in PNG, SVG, or PDF formats.
SLAM, IBIS and ANN – Prediction Tools
RNA-Seq has brought about a revolution in the study of gene expression. It is now possible to study the expression landscape of virtually any organism even if a species specific microarray is not readily available. However, each technological innovation brings about new problems, and for RNA-Seq, it is with the sheer quantity of data that is produced. Eyeballing it is no longer an option, and Excel will not cope. While there are powerful tools out there, they are often command-line driven, and tight deadlines mean you don't have the time to learn how to use yet another package on the command line. With Sequencher, we gave you the ability to run your NGS alignments and gene expression differential analysis through an intuitive GUI. Now with CodeLinker, we are giving you the same ability, an easy-to-use GUI with which to analyse your RNA-Seq data in depth, from your desktop.
Imagine that you are studying a disease but the expression data you have so far, while indicative, are not diagnostic. Starting with RNA-Seq differential expression results, you can use CodeLinker to find new patterns of gene expression using SLAM and ANN that are predictive of the disease you are studying. The data you have contains hidden associations (sets of genes and expression values), and with the help of SLAM/ANN, or IBIS, you can uncover these associations.
Sub-Linear Association Mining (SLAM) searches your gene expression data looking for sets of features, that is to say patterns of expression, which occur together more than might be expected by chance and discriminate between the values of any variable (gene). Once you have exposed these associations, you can then use them to train the ANN or Artificial Neural Network and then go on to classify test data. While they say that nothing good comes from a committee, that is definitely not the case when you use committees of networks. This method will generate more accurate results than would be obtained from just a single neural network. Having taught the ANN with the training data you provided, you are now ready to analyse your test data. The results can be displayed in the graphically rich Classification Plot, giving each sample in your test set a classification – predicted, true class (something different from your original classification), or unknown. This will assist in confirming or refuting your hypothesis concerning the data.
Suppose you are studying the response of certain cell types to a drug treatment. Just as you told Cuffdiff the conditions for each sample (drug dose, tissue type, phenotype), you can use the same classifications with CodeLinker. Two sets of information are imported into CodeLinker – a set of expression data, and the list of tissues and their responses to the drug treatment. These can be explored using the Integrated Bayesian Inference System (IBIS).
The IBIS classifier is a method that uses Bayesian probabilities to look at the patterns in your data. IBIS offers powerful search capabilities into your data. It can identify non-linear and combinatorial patterns of gene expression that characterize different toxicity responses, disease states, or treatment outcomes. Furthermore, it can be used to build classifiers that can identify these patterns in new samples. It can also be used as a search tool to identify single genes and small gene sets that show interesting expression patterns relative to the sample classification.
These are imported into CodeLinker, then a Linear Discriminant Analysis search is performed which evaluates the accuracy of each gene when used as linear discriminator i.e. it has the ability to separate the data into two or more classes. Genes with lower Mean Square Errors (MSEs) reflect how well the data matches the linear model. Choosing one of these, you can then display the results of the analysis on a plot whose background color gradient represents the classification that IBIS discovered; the plot displays the gene expression and spots with colors representing the initial classification that you gave. You can see with ease whether the genes you have focussed on fit the pattern you expected or are exposing new and unexpected correlations. With the 2D plot, you can explore the putative relationships between pairs of genes and find new relationships that divide along the line of your initial classification. Spotting the false positives and false negatives is as simple as looking at the colors of the spots in relation to the colored background on the graph. For more complex data where there is no linear relationship, CodeLinker provides you with the ability to perform 2D Linear Discriminant Analysis or even Quadratic or Gaussian Discriminant Analysis.
Meet CodeLinker, the tool that will revolutionize your RNA-Seq analysis. Analyze your Cuffdiff results with ease using CodeLinker's powerful clustering tools and visualizations. CodeLinker will help you find the associations and relationships you are look for in your data.
CodeLinker is a user-friendly and powerful desktop software program for analyzing your microarray data with powerful analyses and rich vizualizations. It's a unique paradigm where every analysis is an experiment that makes your work really flow. And for those experiments where you are looking for concordance between your RNA-Seq and Microarray data, shouldn't you have concordance in your analysis software as well? You can with CodeLinker.