Smart diagnostics for complex systems

The high-tech industry is facing an increasing demand from customers to deliver performance and availability based contracts. This pushes the high-tech industry to re-invent its traditional diagnostics strategy from a task provided by the service organization towards automated diagnostics by design. While facing this change, the high-tech industry is simultaneously challenged by the difficulties originated by the increasing complexity of its systems.

For these reasons, we need to develop efficient techniques for assisted diagnosis of the increasingly complex high-tech systems and new methodologies to automatically predict the reliability of the system to perform its tasks.

There is a growing need for better and more efficient diagnostics strategies. The complexity of high tech industrial systems increases every year and this brings not only challenges for their design and interaction but also for their diagnosis.

Additionally, the switch from providing products towards delivering services translates into a change of cost of ownership from users towards manufacturer. The current advances in data science bring new opportunities for diagnosis to reduce the time to diagnose.

  1. End-to-end assisted diagnostics methodology leaflet
  2. Detecting system anomalies using a digital twin leaflet
  3. From physical models to Bayesian networks for diagnostics leaflet
  4. From a system problem to the component-level root-cause leaflet
  5. Insight into system dynamics with TRACE leaflet

1. End-to-end methodology for assisted diagnostics

Despite the boom of data based algorithms, a diagnostics methodology to handle the system complexity in a systematic manner is still missing.

Assisted Diagnostics Methodology

Together with one of our industrial partners, we developed an end-to-end model-based methodology to assist diagnostics of existing problems and to predict upcoming issues within complex production chains. The purpose is to resolve such problems either by an intervention (service action, repairs) or by compensative actions via adaptive control. The assisted aspect of the diagnostics methodology implies automating data analysis and timely detection of issues, as well as guiding experts through a structured and consistent flow of complex diagnostics steps. The result of this assistance is to bring gains in diagnostics efficiency and operational optimization of the high-tech process chains.

ADS Methodology Assisted Diagnostics

The diagnostics flow usually encompasses: a monitoring step in which equipment and products-inprogress are measured regularly during processing to identify anomalies from normal operation or specification a root-cause analysis step to identify root causes of anomalies, and an intervention step by service actions or repairs to resolve the issues. The basis for conducting the diagnostics is operational data.

These data often come from various (unstructured) sources such as interface logging, machine log files, calibration and test results, and job reports, and need to be continuously combined and analyzed. Our methodology builds on models from two main knowledge sources – experts and data – including techniques such as domain-specific languages to enable domain modelling, data analysis to discover knowledge and probabilistic analysis models for assisted root cause analysis.

The generic nature of the methodology allows its application to other technical areas such as largescale distributed Internet-of-Things platforms (e.g., intelligent lighting systems) as well as for other purposes such as prognostics of end-product performance.

Demonstrator 

The developed demonstrator follows the diagnostics flow presented above and it is driven by a probabilistic engine built in a systematic manner from data and experts.

Please contact us for further details and a showcase of the demonstrator.


2. Detecting system anomalies using a digital twin

Data meets knowledge

Challenges

The behavior of large-scale distributed IoT systems is hard to verify and validate. The reasons include:

  1. the specification is often unclear, ambiguous and incomplete.
  2. it is humanly impossible to reason about the correctness of a system consisting of thousands of components.
  3. It is very hard to observe all related components when trying to solve a problem.

Get your specification right

In the OpenAIS project, ESI’s knowledge and experience on the application of Domain Specific Languages (DSLs) was used to model the behavior of intelligent IoT lighting systems. We introduced generic IoT DSLs to specify unambiguous behavior models and added validation rules assisting in creating correct models.

root-cause analysis

Figure 1: Specification and verfication of system behavior using DSLs and Digital Twins. Operational data (red arrows) and reference data (blue arrows) are used in a root-cause analysis application

The DSL models that capture the knowledge of the system are used to generate virtual prototypes. One of these virtual prototypes is a simulator which receives virtual sensor events and triggers virtual actuators according to the system specification. It is used to analyze system aspects such as behavior validation with the customer, energy consumption, and effects of message loss and network latencies. The models are adapted until analysis shows that the system behaves as expected by the customer.

digital twin

Figure 2 Comparator output: blue line reference actuator data, green/red dots "correct"/"anomaly" operational actuator data sample, purple dots operational presence sensor data

We developed a digital twin that combines knowledge and operational data to monitor the behavior of the IoT lighting system. To achieve this, we transformed the generated virtual prototype into a digital twin by feeding it operational sensor data collected from the physical system. As a result, the digital twin will control its virtual actuators according to system specification, generating reference actuator data (Figure 1).

Now, both operational actuator data as well as digital twin generated reference data are available for digital twin applications to use. 

In OpenAIS, we used the digital twin for behavioral anomaly detection in a Root Cause Analysis (RCA) approach. An application was developed to compare the operational actuator data to the reference actuator data (see Figure 2). Based on the results from FMEA and HAZOP studies and the available data, a semi-automatic diagnosis and resolution method was developed and deployed.

The RCA approach was used in a pilot project on the fifth floor of the “Witte Dame” building in Eindhoven, where an OpenAIS intelligent lighting system with approximately 1,200 components was installed. The RCA approach proved to be very successful, as thousands of behavioral anomalies were detected, diagnosed and resolved quickly.


3. From physical models to Bayesian networks for diagnostics

The high-tech industry is facing an increasing demand from customers to deliver performance and availability based contracts. This pushes the high-tech industry to re-invent its traditional diagnostics strategy from a task provided by the service organization towards automated diagnostics by design. While facing this change, the high-tech industry is simultaneously challenged by the difficulties originated by the increasing complexity of its systems.

For these reasons, we need to develop efficient techniques for assisted diagnosis of the increasingly complex high-tech systems and new methodologies to automatically predict the reliability of the system to perform its tasks. 

A methodology for system level diagnostics

Together with one of our industrial partners, we developed a methodology to assist operators of complex industrial systems to achieve maximum availability with the minimum cost.

Diagnostics

The methodology has probabilistic reasoning, physical models and operational data as key components. The operational data from complex electromechanical systems is often incomplete and comes from unstructured sources. We therefore start our methodology with a model based approach, using physical models to clean and complement the operational data. These models can also be used to predict the degradation and subsequent remaining useful life of key system components in such a way to assist operators to plan maintenance interventions in advance.

The augmented data is then used for machine learning that determines the parameters of a probabilistic reasoning model. The reasoning model produces a system readiness assessment, extending the condition monitoring capabilities from reasoning at component level towards reasoning at system level.Finally, novel visualizations clarify the uncertainty expressed in the numbers of the inferred probabilistic predictions and by fusing the probability fields into a stepwise regression we can predict the long-term developments of key system’s components.

Building a reasoning engine for diagnostics

The probabilistic reasoning model is built within an object-oriented programming approach. First, we compile a library of sub-models for each type of component in the complex system, then assembling the sub-models into the complete probabilistic reasoning model. The library is built using expert knowledge and the specifications of the system. 

Diagnostics

We realize the assembly by fitting the library to the system’s schematics using generative techniques.This structured procedure ensures scalability and maintainability of the reasoning model, together with limiting the modelling efforts for the experts.

Overall, the proposed methodology for system’s diagnostics is domain independent and applicable to a variety of complex industrial systems.


4. From a system problem to the component-level root-cause

Challenges and opportunities

Easy data access and utilization throughout industrial processes are fundamental elements of Industry 4.0. High-tech companies encounter opportunities of using the emerging data to improve their system life cycle and to define new business cases by transforming themselves into service providing companies. Moreover, high-tech companies also face previously unseen challenges, such as the significant increase in system complexity, customization, continuous evolution and diversity of operational environment.

analysis flow of the system diagnostics demonstrator

The fig 1 analysis flow of the system diagnostics demonstrator

The traditional system engineering approaches, heavily relying on human experience, are limited in addressing these challenges. The emerging data of high-tech systems bring the opportunity of going beyond the capability of human experience and thus enhancing system engineering. ESI empowers system engineering by exploiting data insights in a methodological manner to support high-tech systems’ innovation

Towards next level of system engineering

Our know-how starts with understanding and translating companies’ various business goals into the relevant data-driven application areas and requirements. For example, product customization could be well defined based on the system usages, which are learnt from the operational data. Next, we enhance system engineering with methodologies, realized as prototypes, that integrate the insights discovered from data and the knowledge of the systems into system-level reasoning, shown in Figure 2. This is achieved by carrying out knowledge engineering to construct domain models, which structurally capture the scattered knowledge across documents, engineering code, or the experts’ brains. In the context of data-driven applications, the knowledge engineering facilitates the effective analysis of operational data. The additional exploitation of data science techniques allows the discovery of operational models.

The integration of these knowledge-driven and data-driven approaches enable continuous system evolution and operational support by re-usage and strengthening of the company knowledge.

Demonstrator on system diagnostics

We developed a demonstrator to explore and address relevant challenges of putting the methodology into practice. It sketches out the landscape and integration principles of knowledge-assisted data analysis techniques applied to a variety of data streams available within high-tech industry.

As an industrial showcase, our demonstrator presents semi-automatic identification of the main root cause of a factory (system-of-systems) performance degradation through a guided and deep-dive analysis into the level of machine-specific components (in this case a software task), shown in Figure 1.

ESi's methodology of data and knowledge integration
ESI uses the developed methodologies as a driver to improve company processes and enrich competencies needed to embed the results in the organization, thereby facilitating the innovation of product and service development. read more


5. Insight into system dynamics with TRACE

Insight into system dynamics through execution traces

Understanding, analyzing and ultimately solving performance issues in high-tech systems is a notoriously difficult task. Performance is a cross-cutting concern often affected by many system components. One simple yet effective way of gaining insight into performance is to capture and study execution traces that involve the most significant components. Execution traces are time-stamped sequences of events, each annotated with information such as the name and type of the event. Although an execution trace is a model of a single system behavior only, we can gain much insight by studying them. TRACE is a lightweight tool that enables this. It provides visualization of the execution trace as a Gantt chart. Furthermore, it provides sophisticated, automated analysis techniques tailored to answering typical questions such as ‘What’s the bottleneck?’, ‘Are the throughput and latency requirements met?’ and ‘How can we improve performance?’.

TRACE workflow

Figure 1: TRACE workflow

The concept of 'performance by construction’ guarantees performance properties on the left-hand side of the well-known V-model of development. It improves current state-of-practice in which performance properties often emerge late in the development process (on the right-hand side of the V-model) and thus are very expensive to fix if they are not satisfactory. Executable models (virtual prototypes and digital twins) and their analysis play an important role in realizing performance by construction. TRACE contributes as it can visualize and analyze both simulation traces in models and execution traces in real systems, through which engineers can acquire greater insight into their designs’ inner workings with respect to performance in the early phases of development. 

Trace visualization and analysis

TRACE is available as an Eclipse plug-in at https://www.esi.nl/solutions/trace. The input is a plain text file describing a set of claims. A claim has a start time and an end time, a part of a resource that is claimed during this time and a number of user-defined attributes such as the name of the component making the claim. For example: ‘Component A claims 2 cores of a CPU from time 0.020 ms to time 0.040 ms’. The modelling of an execution is motivated by the Y-chart decomposition of systems into an application that is mapped to a platform (a set of resources). TRACE and its underlying concepts are easy to learn and domain independent, and the tool’s application potentially has great benefits. The visualization alone is often very useful as it displays the parallel activities and their respective durations. The human brain can quickly identify patterns and anomalies in such views. As traces grow in size, however, automated methods are needed to support the analysis. TRACE’s analysis support distinguishes it from other Gantt chart viewers.

TRACE has the following automated analysis techniques:

  • Resource-usage analysis can, for instance, be used to identify highly loaded resources. 
  • Throughput analysis computes the course of the system throughput using a (continuous) sliding-window average over the events in the execution trace.
  • Critical-path analysis shows the bottleneck activities and resources.
  • Anomaly detection can identify irregularities in mostly regular execution traces.
  • Distance analysis compares traces to each other. This can be useful to calibrate a model using a trace from the system that is being modelled.
  • Temporal logic (MTL and STL) can be used to formally specify and verify properties of traces, such as an upper bound on latency or a lower bound on throughput.  

TRACE Gantt chart

Figure 2: A typical visualization of an execution trace as a Gantt chart.

read more

Carmen Bratosin

+31 (0)88 866 54 20 (secretariat)
carmen.bratosin@tno.nl

“Modelling, and in particular, discrete event systems models have been at the core of my work in the past 15 years: creating design models, discovering them from usage data, analysing them etc. ”