ResistoXplorer

Frequently Asked Questions (FAQs)

When should I use ResistoXplorer?
What types of input does ResistoXplorer accept?
Does ResistoXplorer accepts raw sequencing data as an input?
How can I format my data into an acceptable input for ResistoXplorer?
Why should I use the data filtering option?
Which category should I choose to perform data filtering?
What should I normalize my data?
What are the various normalization methods available in ResistoXplorer?
Which normalization method should I choose?
When should I opt to rarefy my data?
What is the Unassigned category?
How can I detect and deal with outliers?

When should I use ResistoXplorer?

ResistXplorer is a user-friendly, comprehensive web-based tool for analyzing data sets generated from AMR metagenomics studies. The focus of this tool is to perform visual, statistical and exploratory analysis of resistome data. It is designed to support five kinds of analysis:
- Compositional profiling: provides various analyses commonly used in community ecology such as alpha diversity, rarefaction curves or ordination analysis, coupled with various intuitive visualization support (interactive sankey diagram, zoomable sunburst, treemap, stacked bar or area plots) to assist users gain a comphrehensive view of the composition of their uploaded resistome profiles.
- Functional profiling: supports analysis and profiling of resistome at various functional categories or levels (e.g. Drug class, Mechanism) to help users acheive more biological and actionable insights together with a better understanding of their data.
- Comparative analysis: providing multiple standard as well as more advanced (CoDA-based) normalization techniques coupled with differential analysis approaches (edgeR, DEseq2, LEfSE, metagenomeSeq, ANCOM and ALDEx2) to detect features (ARGs) that are significantly different between conditions under investigation.
- Integrative analysis: enable users to integrate thier paired resistome and microbiome abundance profiles for understanding the complex interplay as well as exploring the potential associations between microbial ecology and AMR using several univariate and multivariate omics integration statistical methods.
- ARGs-microbial host network exploration: allow users to intuitively explore the microbial hosts and functional associations of their uploaded list of antimicrobial resistance genes (ARGs) using a powerful and fully featured network visual analytics system to obtain better interpretation of AMR resistance mechanisms and information on possible dissemination routes of ARGs.
What types of input does ResistoXplorer accept?

ResistoXplorer supports most common format generated in metagenomic-based AMR studies i.e., count abundance tables (ARG/OTU/taxa), together with metadata and annotation files. Users can found the detailed descriptions, example datasets and screenshots on the Data Format page.
Does ResistoXplorer accepts raw sequencing data as an input?

ResistoXplorer is explicitly designed for downstream analysis and currently does not accept raw sequencing data. The annotated ARG count or abundance table obtained after preprocessing and upstream analysis of metagenomic sequencing data is the standard input for ResistoXplorer.
How can I format my data into an acceptable input for ResistoXplorer?

For shotgun metagenomics data, there are several bioinformatics pipelines for processing the raw sequence data into an ARG count table. For instance,
- AMRPlusPlus: is a Galaxy-based metagenomics pipeline that uses current and new tools to help identify and quantify ARGs within metagenomic sequence data to generate count table in text format.
- ARG-OAPs: it utilizes the SARG database and a hybrid functional gene annotation pipeline to do fast annotation and classification of ARG-like sequences from metagenomic data.
- DeepARG: is a fully automated data analysis pipeline for antibiotic resistance annotation of metagenomic data using the deepARG algorithm and deepARG-DB database.
For more information on available methods and databases for resistome annotation, please refer to the review articles by Moolchandani, B et al. and Gupta, CL et al.
Why should I use the data filtering?

The purpose of data filtering is to allow users to remove low quality and/or uninformative features to improve downstream statistical analysis. ResistoXplorer contains three data filtering procedures:
1. Minimal data filtering - the procedure will remove features containing all zeros, or appearing in only one sample with very extremely low counts (considered as artifacts). These extremely rare features should be removed from analysis due to biological and technical considerations. It applies to all analysis except alpha and rarefaction analysis;
2. Low abundance features - features that are present in a few samples with very low read counts are may be due to sequencing errors or low-level contaminations;
3. Low variance features - features with constant variance across all the samples are unlikely to be associated with the conditions under study;
Which category should I choose to perform data filtering?

Users can choose from low count abundance filter options to filter features with very small counts in very few sample (i.e. low prevalence). If primary purpose is comparative analysis, users should filter features that exhibit low variance based on inter-quantile range, standard deviation or coefficient of variation, as they are very unlikely to be significant in the comparative analysis.

The Data Inspection page provides the text summary and statistics of the uploaded files (Text Summary tab) along with graphically describing the read counts for all uploaded samples (Library Size Overview tab), which are informative in selecting or choosing the appropriate filtration method and their cutoff based on the data.
Why should I normalize my data?

Metagenomic data possess some unique characteristics such as vast differences in sequencing depth, sparsity, skewed distributions, over-dispersion and compositionality. Hence, it is critical to normalize the data to achieve comparable and meaningful results. To deal with these issues, ResistoXplorer supports three kind of normalization approaches such as:
- Rarefaction: this method deals with uneven sequencing depths by randomly removing reads in the different samples until the sequencing depth is equal in all samples.
- Scaling-based methods: these methods account for uneven sequencing depths by deriving a sample-specific scaling factor for bringing samples to the same scale for comparison.
- Transformation-based: it includes approaches to deal with sparsity, compositionality, and large variations within the count data.
What are the various normalization methods and which method should I choose?

ResistoXplorer supports a variety of methods for data normalization. A brief summary is provided below:
- Rarefaction: this method deals with uneven sequencing depths by randomly removing reads in the different samples until the library size of all the samples are same as sample with lowest sequencing depth. Whenever the library size of the samples varies too much (i.e., >10X), it is recommended to perform rarefaction before normalizing your data. It also seems to perform better during ordination or clustering analysis. For more details, please refer to paper by Weiss, S et al.
- Total Sum Scaling (TSS) normalization: this method removes technical bias related to different sequencing depth in different libraries via simply dividing each feature count with the total library size to yield relative proportion of counts for that feature. For easier interpretation, we can multiply it by 1,000,000 to get the number of reads corresponding to that feature per million reads. It is also known as Count per Million (CPM) normalization. LEfSe utilizes this kind of approach for normalizing data before conducting statistical testing.
- Log Count per million (CPM) normalization: this method perform log transformation on count per million normalized data in order to deal with large variance in count distributions in addition to library size differences. his kind of approach is been used by R packages such as edgeR and voom which are designed for identifying differentially abundant genes in RNA-Seq count data.
- Cumulative Sum Scaling (CSS) normalization: this method corrects for differences in library size by calculating the scaling factors as the cumulative sum of gene abundances up to a data-derived threshold to remove the variability in data caused by highly abundant genes. By default, metagenomeSeq utilizes this approach for differential analysis.
- Relative proportion: this approach computes the relative proportion of a feature by dividing each feature count by the total number of counts (library size) per sample.
- Relative log expression (RLE) normalization: this method estimates the median library from the geometric mean of the gene-specific abundances over all samples. The median ratio of each sample to the median library is used as the scaling factor. By default, DESeq2 utilizes this approach for identifying differentially abundant genes. This method was proposed by Anders & Huber.
- Trimmed mean of M-values (TMM) normalization: this method is proposed by Robinson & Oshlack, where the scaling factor is derived using a weighted trimmed mean over the differences of the log-transformed gene-count fold-change (relative abundance) between the samples. By default, edgeR utilizes this approach for differential analysis in ResistoXplorer.
- Upper Quantile normalization: this approach calculates the scaling factors from the 75th percentile of the gene count distribution for each library, after removing genes which are zero in all libraries. This method is derived from edgeR package proposed by this method proposed by Bullard et al.
- Hellinger Transformation: this method computes the relative proportion of a feature by dividing each feature count by the total number of counts (library size) per sample, and then taking the square root of it.
- Log-Ratio (CLR and ALR) Transformation: these methods are specifically designed to normalize compositional data. They transforms the relative abundances of each element, or the values in the table of counts for each element, to ratios between all parts by using either geometric mean of the sample or single element as the reference. Further, taking the logarithm of these ratios, brings the data in a Euclidean (real) space, such that standard statistical methods can be applied. CoDA-based methods such as ALDEx2 and ANCOM utilize this approach for normalizing data before conducting statistical testing.
Which normalization method should I choose?

There is no consensus guideline with regard to which normalization performs the best and should be used for all types of datasets.The choice of method is dependent upon the type of analyses to be performed. Users can explore different approaches and visually investigate the clustering patterns (i.e., ordination plots, dendrogram and heatmap) to determine the effects of different normalization methods with regard to the experimental factor of interest. For detailed discussion about these methods and their performance on different type of analyses, users are referred to these comparative studies by Pereira, MB et al., Jonsson, V et al., Weiss, S et al. and McMurdie, PJ et al.

The normalized data is used for exploratory data analysis including ordination, clustering and integrative analysis . Also, differential abundance testing using different approaches are performed on normalized data. However, each of these methods will use their own specific normalization procedure. For example, the relative log expression (RLE) normalization is used for DESeq2, and the centered log-ratio transformation is applied for ALDEx2.
When should I opt to rarefy my data?

Whenever the library size or sequencing depth of your samples differ from each other by more than 10 fold, it is recommended to perform rarefaction before normalizing your data. Such issue cannot be addressed by normalization methods directly. Note, users should also consider to remove the shallow sequenced samples as such gross difference could be due to experimental failure. For more details, please refer to the paper by Weiss, S et al. In ResistoXplorer, users can directly visualize the Library size (available on the Data Inspection page) or either perform rarefaction curve analysis to determine the sequencing depth with respect to number of ARGs (features) detected for each sample in order to check the need of rarefaction for their data.
What is the Unassigned category?

Unassigned category contain features without or missing (have NA) annotation at certain functional level. For instance, if you are looking at Class level, and 20% of the features do not have a Class-level annotation, these are merged together as Unassigned.
How can I detect and deal with outliers?

Outliers refers to those samples that are significantly different from the "majority". The potential outlier will distinguish itself as the sample located far away from major clusters or trends formed by the remaining data. To discover potential outliers, users can use a variety of summary plots to visualize their data. For instance, a sample with extreme diversity (alpha or ordination) or very low sequencing depth (rarefaction curve analysis or Library size overview) may be considered a potential outlier. Outliers may be arise due to either biological or technical reasons. To deal with outliers, the first step is to check if those samples were measured properly. In many cases, outliers are the result of operational errors during the analytical process. If those values cannot be corrected (i.e. by normalization procedures), they should be removed from your analysis.

What is alpha diversity analysis and which metric should I use to perform alpha diversity analysis?
Why there are error bars in some of the measures to calculate alpha diversity?
ow to choose between different measures for assessing dissimilarities between samples?
How to visualize the dissimilarities or distance matrix?
How to identify particular biological patterns in my dataset from ordination analysis?

What is alpha diversity analysis and which measure should I use to perform such analysis?

Alpha diversity is an ecology-based analysis used to measure the diversity within a sample. It summarizes both the feature richness (total number of features) and/or evenness (abundance distribution across features) within a sample. Eight alpha diversity measures are currently supported in ResistoXplorer, each assessing different aspects of the community. For instance, Observed calculates the total number of unique features found in each sample, whereas Chao1 and ACE estimate feature richness by accounting for features that are unobserved because of low abundance. Shannon and Simpson measure take both feature richness and evenness into account, with varying weight given to evenness. Further details on different diversity measures can be found here.

The recent comparative study by Palme, JB et al. suggested that Chao1 estimator performed very well and recommended to use for estimating ARG diversity, while the performance of other measures such as Shannon, Simpson and ACE indices was not very satisfactory.
Why there are error bars in some of the measures to calculate alpha diversity?

These bars basically represent the Standard Error (SE) in calculating the alpha diversity. Metrices that estimate a SE include ACE and Chao1, which are used as similarity estimators, allowing for rigorous comparison of two or more similarity index values. SEs are computed by a bootstrap procedure, which requires resampling the observed data and recomputing the estimators many times.
How to choose between different measures for assessing dissimilarities between samples?

Currently, users can calculate dissimilarities between samples using five (including euclidean) commonly used non-phylogenetic-based quantitative or qualitative distance measures. The Jaccard distance utilizes just the presence or absence of features to estimate differences in resistome composition; Bray-Curtis dissimilarity uses abundance data and calculates differences in feature abundance; Jensen-Shannon divergence calculates the distance between two probability distributions that account for the presence and abundance of resistome features. Further details on different measures can be found here.

Since the different types of distance measures have specific niches and can affect the outcomes and the interpretation of the analysis, it is recommended to apply different measures to better understand the factors underlying composition differences. For more details, please refer to papers by Goodrich, K et al. and Kuczynski, J et al.
How to visualize the dissimilarities or distance matrix?

Currently, three widely accepted ordination methods are supported for visualization of distance matrix at low-dimensional space (i.e., 2D-3D plots), including principal coordinate analysis (PCoA), non-metric multidimensional scaling (NMDS) and principal component analysis (PCA). Both PCoA and NMDS take the distance matrix as an input. PCoA maximizes the linear correlation between samples, while NMDS maximizes the rank-order correlation between samples. Users should use PCoA if distances between samples are so close that a linear transformation would be enough. NMDS is recommended if users wanted to highlight the gradient structure within their data. While, PCA is just a type of PCoA that uses euclidean distance. In particular, users can follow a CoDA ordination approach by performing PCA on the centered log-ratio transformed data.
How to identify particular biological patterns in my dataset from ordination analysis?

After displaying the samples on ordination (PCoA/PCA/NMDS) plots, they can be colored based on:
- Particular experimental factor from the metadata (default option);
- Their alpha diversity measures (shown as color gradients);
Note the use of this coloring option can be combined with the "label by" option to reveal more details about the available patterns.

What are the different methods for performing differential testing are available in ResistoXplorer?
Which RNASeq method should I use for my data (EdgeR vs DESeq2)?
How does metagenomeSeq works?
What are the differences of the two statistical models in metagenomeSeq?
How does LDA Effect Size (LEfSe) works?
How does ANCOM works?
How to interpret W value in ANCOM results?
How does ALDEx2 works?
How should I choose among different methods for differential analysis?
How does Random Forests work?
How to interpret the graphical summary of Random Forests plot?
How does SVM (support vector machine) work?

What are the different methods for performing differential testing are available in ResistoXplorer?

ResistoXplorer supports both standard as well as more recent compositional data analysis (CoDA) based univariate analysis approaches such as: edgeR, DESeq2, metagenomeSeq, LEfSe, ALDEx2 and ANCOM.
Which RNASeq method should I use for my data (EdgeR vs DESeq2)?

EdgeR and DESeq2 are both well-established and robust methods developed for RNA-Seq data, that also perform well for metagenomic data. The differences between them rely mostly on their normalization method and the algorithms used for estimation of dispersion. While edgeR moderates the dispersion estimate for each gene toward a common estimate across all genes using a weighted conditional likelihood, DESeq2 detects and corrects dispersion estimates that are too low through modeling of the dependence of the dispersion on the average expression strength over all samples. In general, DESeq2 is more robust in estimating differential expression features and usually yields a low false positive rate, while edgeR is more powerful, but it can also lead to higher rates of false detection. For more details about their implementation please refer to DESeq2 and EdgeR papers.
How does metagenomeSeq works?

metagenomeSeq is primarily designed to estimate differential abundance in sparse high-throughput microbial marker-gene survey data. This approach uses a combination of Cumulative Sum Scaling (CSS) normalization with zero-inflated Gaussian distribution mixture or zero-inflated Log-Normal mixture model to account for undersampling in large-scale metagenomic studies. For more details, please refer to the original paper here: metagenomeSeq.
What are the differences of the two statistical models in metagenomeSeq?

There are two statistical models implemented in metagenomeSeq to model the data:
- fitFeature: It is based on a zero-inflated Log-Normal mixture model. This approach is recommended by the author. This model currently only supports two-group comparisons.
- fitZig: It is based on a zero-inflated Gaussian mixture model. It can be used when multiple groups are present for differential abundance testing.
How does LDA Effect Size (LEfSe) works?

LDA Effective Size (LEfSe) is a biomarker discovery and explanation method for high-dimensional data. It couples standard tests for statistical significance along with other methods for biological consistency and effect size estimation. Firstly, it uses the non-parametric factorial Kruskal-Wallis (KW) sum-rank test to detect features with significant differential abundance with respect to the class of interest; biological consistency is subsequently investigated using a set of pairwise tests among subclasses using the (unpaired) Wilcoxon rank-sum test. As a last step, LEfSe uses Linear Discriminant Analysis (LDA) to estimate the effect size of each differentially abundant feature. The result consists of a table listing all the features, the logarithmic value of the highest mean among all the classes, and if the feature is discriminative, the class with the highest mean and the logarithmic LDA score (Effect Size).

For more details, please refer to the original paper here: LEfSe
How does ANCOM works?

ANCOM (analysis of composition of microbiomes) is used for detecting differences in microbial mean taxa abundance. In ResistoXplorer, it has been adapted to analyze the composition of resistomes. The methodology tests the log-ratio abundance (to account for compositionality) of all pairs of features for differences in means using nonparametric statistical tests. The number of significant results involving each feature is used to calculate its significance.

For more details, please refer to the original paper here: ANCOM
How to interpret W value in ANCOM results?

The W value is basically just a count of number of sub-hypotheses than have been passed for a given feature. For instance, if you have 500 features, and W=20, for feature x, then sub-hypotheses is passed 20 times. This means, that the ratio feature x and 20 other features were detected to be significantly different across the conditions under study. For more details, please refer to the discussion forum here.
How does ALDEx2 works?

ALDEx2 is implemented as R package to work with differential abundance analysis for the comparison of two or more conditions. It utilizes a Dirichlet-multinomial model that uses counts to infer abundance. This is optimized for three or more experimental replicates. The method infers biological and sampling variation to calculate the expected false discovery rate, given the variation, based on a Wilcoxon Rank Sum test and Welch's t-test (for two groups), and a Kruskal-Wallis test, a generalized linear model, or a correlation test (for more than two groups).

For more details, please refer to the original paper here: ALDEx2
How should I choose among different methods for differential analysis?

Choosing the appropriate statistical method for identifying differentially abundant genes in metagenomic data, depends upon the data characteristics (sample or group size, sequencing depth, effect sizes, genes abundances, etc.) of a given study. Currently, there is no one single statistical method that is suitable for all types of metagenomic datasets. The following guidelines are based on the comparative studies performed by Weiss, S et al. and Jonsson, V et al. on microbiome and metagenomic data. Such large scale benchmarking studies are often constructive, and should be read and carefully analyze to make final decision on which method to choose for users own data.
- DESeq2 and edgeR has the highest power to compare groups, especially for small group (<20 samples per group) and even library sizes;
- MetagenomeSeq’s fitZIG is a faster alternative to DESeq2 for larger group sizes (>50 samples per group);
- For larger sample sizes (>50 samples), rarefying coupled with a non-parametric test, such as the Mann-Whitney test, can also yield equally high sensitivity;
- ANCOM is more stable in terms of false discovery rates (FDR) for a wide range of sample sizes.
How does Random Forests works?

The Random Forest algorithm uses an ensemble of classification trees (forest) where each tree is grown based on a random subset of features from a bootstrap sample at each branch. The final class prediction is based on the majority vote of the ensemble. The unbiased estimate of classification errors is obtained by aggregating cross-validation results using bootstrapped samples while the forest is being constructed. Random forest also measure the importance of each feature based on the increase of the error when it is randomly re-shuffled. It can indicate which groups are easier to predict based on errors. A graphical output is generated to summarize its classification performance. More details about Random Forest can be found here.

How to interpret the graphical summary of Random Forests plot?

The graphical output from Random Forests shows the changes of error rates with regard to the number of trees as the forests built. It can indicate when the performance will stabilize after certain number of trees. Users can also see which group tend to have higher error rates, which group is relatively easy to predict. For instance, the example below shows that using ~70 trees, the algorithm can achieve perfect prediction for each group.

How does SVM algorithm works?

SVM algorithm seeks to find a nonlinear decision function (hyperplane) by projecting data into higher dimensional feature space and separating it by means of a maximum margin hyperplane. The SVM analysis is performed using recursive feature selection and sample classification via linear kernel. Features are selected based on their relative contribution in the classification using cross-validation error rates. The features which hold less variance (least significant) are removed in the successive steps. This procedure generates a series of SVM models (levels). The features used by the best model are considered to be important and ranked based on their frequencies of being selected in the model (also plotted). More details about SVM can be found here.

How do I identify important nodes in my network?

How do I identify important nodes in my network?

The position of the nodes within the network will reveal their importance in terms of connection with other antibiotic resistance genes and microbial hosts. Consequently, changes in the most important positions of the network will have more impact as opposed to less important or more peripheral positions. This means that if you try to ‘click and drag’ an important node, it will carry several other smaller nodes with it. Their position will not change, but you will see that the lines connecting them follow the more important node. The degree centrality indicates the number of connections to other nodes (the node with the highest degree of centrality will act as a hub of the network), while the betweenness centrality measures the number of shortest paths going through the node considering the whole network (the node with the highest betweenness centrality controls the flow of information in the network). Each of the measures is independent. In addition, you can also double click several nodes to select them, and extract them to a new network. With this, you can focus on your ARGs of interest. Or you can use this to analyze parts of the network separately.