^{*}

In this paper, we propose a new approach to detecting outliers in a set of segmented genomes of the flu virus, a data set with a heterogeneous set of sequences. The approach has the following computational phases: feature extraction, which is a mapping into feature space, alignment-free distance measure to measure the distance between any two segmented genomes, and a mapping into distance space to analyze a quantum of distance values. The approach is implemented using supervised and unsupervised learning modes. The experiments show robustness in detecting outliers of the segmented genome of the flu virus.

Recent years have witnessed a dramatic increase in the amount of genome data that is submitted to on-line databases. Analyzing sequence-based datasets is the aim of sequence analysis and biodata mining research fields. The engineering solutions have not been achieved to analyze data sets with heterogeneous feature. In other words, the datasets under consideration are sets of sequences with different biological functions and different base-composition distributions. The problem under consideration has several computational challenges. The first challenge is the representation of the inner information structure of a segmented genome of flu virus in feature spaces. Another challenge is to define a metric and metric space to measure the distance between any two information structures that are embedded in a well-defined feature space or composite feature space, and the third challenge is to analyze a quantum of distance values in distance space. The approach that we propose in this paper is alignment-free approach, which is different from classical alignment approaches in terms of time complexity, selectivity, and sensitivity analysis.

At this point, the structure of this paper can be summarized as follows. In next subsection, we shall present a review of the existing approaches to tackle related research problems. In section (Methods), we shall present the approaches of detecting outliers in segmented genomes of the flu virus. The experiments and results are presented in section (Results). Finally, conclusions and future work will be presented in section (Discussion).

The Influenza virus is a highly mutated virus. It has a negative impact on the human population. Consequently, it has a negative impact on public health and the economy. The virus has a segmented genome that can be encoded to 10-11 proteins. The virus is classified into types and subtypes. The variation in the base composition of the surface proteins haemagglutinin (HA) and neuraminidase (NA) indicates the type and the subtype of the influenza virus [

The influenza virus is a negative stranded RNA-virus. It is classified under the family Orthomyxoviridae. The virus has three types A, B, and C. The most variable type is the influenza virus A compared to other types [

An alignment-free sequence comparison analysis is a new developing research direction. It has the potential of solving the sequence proximity problem with less time complexity compared to the alignment-based analysis [

Any biosequence is linear in time. Therefore, the sequential relation is the most promising feature in biosequences [

One of the distance measures used in multivariate analysis is the angle cosine between two data-vectors [

Han et al. [

Daoud [

Daoud [

The other measure used in measuring the distance between two sequences without using alignment is the Euclidean distance [

As we mentioned in this paper, alignment-free is new developing research direction. There are more than 45 alignment-free tools available with different applications in the area of sequence-analysis. A summary of comparisons between alignment-free and alignment-based algorithms are given in

In this paper, we are focusing on detecting outliers in composite data points (e.g., segmented genome of flu virus). An outlier is a data point that diverge from the majority of other data points in terms of its measured features [

In this section, we presented the most popular alignment-free techniques, and a brief introduction about outlier detection. In the next section, we shall present an outlier detection approach for identifying anomalies in segmented genomes of the flue virus.

In this section, we shall present an approach to identify outliers in a dataset of composite data points. A composite data point is a dataset (e.g., set of data-vectors or set of sequences). We shall present two directions: (1) supervised and (2) unsupervised learning modes. In case of the existing training data, the approach can be designed using sequential computational phases. The first computational phase is to map each composite data point into a feature space by defining (p×1) feature vector. Each composite data point can be mapped into a set of data vectors. Those data vectors are extracted from heterogeneous sequences; therefore, the base composition of nucleotide distribution is expected to be heterogeneous. In this context, the next phase is to build an information structure for each composite data point. One of the most popular information structures is the variance-covariance structure. Measuring the distance between any two information structures can be achieved by defining a distance measure or metric. The metric space is defined as a metric and a class of matrices, where each matrix represents an information structure of a set of data vectors with unknown distribution.

Daoud [_{ij} (γ_{1})), where Ψ is a class of composite data points, and D_{ij} (γ_{1}) is a metric. D_{ij} (γ_{1}) is defined as follows:

where λ_{1} is the largest generalized eigenvalue (associated with the generalized eigenvector γ_{1}) of the matrix _{n} is random vector that measures the occurrences of

Measuring the distance between any two variance-covariance matrices Σ^{(i)} and Σ^{(j)} of the same random vector will results distance values. Those distance values represent a random variable. In case of considering more than one feature mapping or feature vector (i.e., projected data into more than one feature space), in this case those distance values represent a random vector. The random vector is a random distance vector and it has a distribution with statistical characteristics, and in this context, we define the concept of the distance distribution paradigm (for more details, see Daoud and Kremer [

Now consider the analysis of distance values as another phase to integrate the computational process to detect the outliers in distance space. In case of multivariate distance-random vector, there are different multivariate techniques that can be implemented in detecting outliers in distance space. By considering those distance-data vectors as observations of a random vector, we can implement one of the multivariate outlier-detection techniques, which is based on Mahalanobis distance. In this context, the outlier can be defined as a data-vector with largest squared Mahalanobis distance. The empirical distribution function of the ordered square distances and theoretical distribution function, in this case χ^{2}-distribution, can be compared to identify outliers with a specific threshold value (quantile) [

The proposed computational approaches are bio-data mining approaches and it is build upon using data-vectors extracted from biosequences based on

In this section, we shall present the robustness of the proposed approaches by showing two experiments. In the fist experiment, we downloaded 47 and 46 segmented genomes of the flu virus A and B respectively, from NCBI website [

The output of the proposed approach is illustrated in

Finally, in this section, we presented the results of implementing the proposed outlier detection approach. In next section, we shall present conclusions and future work.

In this paper, we proposed a new approach to detect outliers in segmented genomes of the flu virus. The flu virus has eight segments that can be encoded into 10–11 proteins, where each protein has different biological function and consequently has different nucleotide composition. Those segmented genomes are heterogeneous by nature. The computational challenges are solved in systematic approach, as feature mapping into the feature space, composite feature representation as variance-covariance matrices, defining a metric space to measure the distance between any two variance-covariance matrices, and finally analyzing those distance-values in the feature space. To evaluate the approach, we implemented it using two datasets: (1) 47 segmented genome of the flu virus A and (2) 46 segmented genomes of the flu virus B. The output of the proposed approach shows the difference between supervised learning and unsupervised learning, and we identified the weaknesses and strengths of each learning mode.

No potential conflict of interest relevant to this article was reported.

I revised the final version of this paper in Lebach, Saarland State, Germany in 2019.

The sketch of the proposed computational model. The upper part represents the abstract of mapping, while the lower part represents the problem under analysis and the expected output.

Algorithm 1: Detecting outliers in sets of biosequences using supervised approach.

Algorithm 2: Detecting outliers in sets of biosequences using unsupervised approach.

The output from implementing the proposed supervised outlier detection approach (flu virus A). (A, B) The sub-graphs represent the scatter diagram and the distance distribution of composite data points respectively. (C, D) The sub-diagrams represent outlier detections using different quintiles. The figures are generated by using R-package: mvoutlier. We use the function aq.plot to process the distance data-vectors. In addition, left-upper subfigures showing the data projected into two-dimensional space using the first and second principal components.

The output from implementing the proposed unsupervised outlier detection approach (flu virus A). (A, B) The sub-graphs represent the scatter diagram and the distance distribution of composite data points respectively. (C, D) The sub-diagrams represent outlier detections using different quintiles. The figures are generated by using R-package: mvoutlier. We use the function aq.plot to process the distance data-vectors. In addition, panel A showing the data projected into two-dimensional space using the first and second principal components.

The output from implementing the proposed supervised outlier detection approach (flu virus B). (A, B) The sub-graphs represent the scatter diagram and the distance distribution of composite data points respectively. (C, D) The sub-diagrams represent outlier detections using different quintiles. The figures are generated by using R-package: mvoutlier. We use the function aq.plot to process the distance data-vectors. In addition, left-upper subfigures showing the data projected into two-dimensional space using the first and second principal components.

The output from implementing the proposed unsupervised outlier detection approach (flu virus B). (A, B) The sub-graphs represent the scatter diagram and the distance distribution of composite data points respectively. (C, D) The sub-diagrams represent outlier detections using different quintiles. The figures are generated by using R-package: mvoutlier. We use the function aq.plot to process the distance data-vectors. In addition, panel A showing the data projected into two-dimensional space using the first and second principal components.

Comparison between alignment-free and alignment-based techniques [

Feature | Alignment-based | Alignment-free |
---|---|---|

Input data | Sequences | Data-vectors |

Assumptions about data | Required | Not-required |

Computational scheme | Dynamic programming | Distance-measures |

Time complexity | Quadratic | Linear |

Applications | Sequence comparison | Sequence comparison |

Phylogenetic tree | Phylogenetic tree | |

Function prediction | General mapper | |

Genome assembly | Genome assembly | |

Reads correcting errors | Reads error correction | |

- | Transcript quantification | |

Metagenomics | Metagenomics |

The output of supervised and unsupervised learning in detecting the outliers of segmented genomes

Learning approach | Outliers of outliers of flu virus A | Outliers of outliers of flu virus B |
---|---|---|

Supervised | 25, 29 | 28, 6, 26 |

Unsupervised | 33, 41, 46, 47 | 46, 4, 45 |

45, 44, 43, 42 | 43, 44, 42 |