Symbolic Data Representation of Multi-Variate Machine Measurement Data to Identify Quasi-Linguistic Patterns with Machine Learning

Philip Nuser

Research output: ThesisMaster's Thesis

48 Downloads (Pure)


This thesis is an exploratory work of unsupervised anomaly detection in multi-variate time-series machine data with methods originating from natural language processing. The foundation is laid by tokenizing the time-series data, i.e., converting the numeric machine data into symbolic data similar to textual data. The process of tokenization is realized by discretizing the data and assigning unique tokens to the discrete values. The symbolic sequences obtained are then inspected for anomalies with two different approaches. The first method is based on word n-gram language models. A n-gram is a sequence of words of length n. The counts of those n-grams in the data set are computed and weighed with a measure for the importance of a word to a document, the term frequency-inverse document frequency measure, to derive anomaly scores for each sequence. The second approach presented utilizes a machine learning model, more specific a masked language model with a transformer architecture at its core. Random tokens in the input sequences get masked and the transformer is trained to recreate the numeric sequences. When an input sequence that has not been used for training outputs a diverging numeric sequence, anomalies in this sequence are expected. Both anomaly detection methods were programmed and successfully applied to an unlabeled data set originating from instrumented machinery used for ground improvement of building foundations. The results indicate that both approaches are principally functional and strongly justify continued work in this area, especially on the machine learning model.
Translated title of the contributionSymbolisierte Datenrepräsentation in Kombination mit maschinellem Lernen zur Identifizierung von quasi-linguistischen Mustern in multivariaten Maschinendaten
Original languageEnglish
Awarding Institution
  • Montanuniversität
  • O'Leary, Paul, Supervisor (internal)
Award date15 Dec 2023
Publication statusPublished - 2023

Bibliographical note

no embargo


  • anomaly detection
  • multi-variate
  • time-series
  • machine data
  • machine learning
  • artificial intelligence
  • unsupervised learning
  • linguistic methods
  • neural network
  • transformer
  • attention
  • self-attention
  • large language model
  • tokenization
  • discretization
  • statistics
  • masked language model
  • BERT
  • symbolic data
  • n-gram
  • bag-of-words
  • bag-of-ngrams
  • term frequency-inverse document frequency
  • TF-IDF

Cite this