Hybrid feature tweaking

2021. Tony Mattias Lindgren. ICCDE 2021: 2021 7th International Conference on Computing and Data Engineering, 20-26

Conference

When using prediction models created from data, it is in certain cases not sufficient for the users to only get a prediction, sometimes accompanied with a probability of the predictive outcome. Instead, a more elaborate answer is required, like given the predictive outcome, how can this outcome be changed to a wished outcome, i.e., feature tweaking. In this paper we introduce a novel hybrid method for performing feature tweaking that builds upon Random Forest Similarity Tweaking and utilizing a Constraint Logic Programming solver for the Finite Domain (CLPFD). This hybrid method is compared to only using a CLPFD solver and to using a previously known feature tweaking algorithm, Actionable Feature Tweaking. The results show that the hybrid method provides a good balance between the distances, comparing the original example and the tweaked example, and completeness, the number of successfully tweaked examples, compared to the other methods. Another benefit with the novel method, is that the user can specify a prediction threshold for feature tweaking and adjust weights of features to mimic the real-world cost of changing feature values.

Prediction of Global Navigation Satellite System Positioning Errors with Guarantees

2021. Alejandro Kuratomi Hernandez, Tony Lindgren, Panagiotis Papapetrou. Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track, 562-578

Conference

Intelligent Transportation Systems employ different localization technologies, such as the Global Navigation Satellite System. This system transmits signals between satellite and receiver devices on the ground which can estimate their position on earth’s surface. The accuracy of this positioning estimate, or the positioning error estimation, is of utmost importance for the efficient and safe operation of autonomous vehicles, which require not only the position estimate, but also an estimation of their operation margin. This paper proposes a workflow for positioning error estimation using a random forest regressor along with a post-hoc conformal prediction framework. The latter is calibrated on the random forest out-of-bag samples to transform the obtained positioning error estimates into predicted integrity intervals, which are confidence intervals on the positioning error prediction with at least 99.999% confidence. The performance is measured as the number of ground truth positioning errors inside the predicted integrity intervals. An extensive experimental evaluation is performed on real-world and synthetic data in terms of root mean square error between predicted and ground truth positioning errors. Our solution results in an improvement of 73% compared to earlier research, while providing prediction statistical guarantees.

Z-Hist

2021. Zed Lee (et al.). Advances in Intelligent Data Analysis XIX, 376-388

Conference

Multivariate histogram snapshots are complex data structures that frequently occur in predictive maintenance. Histogram snapshots store large amounts of data in devices with small memory capacity, though it remains a challenge to analyze them effectively. In this paper, we propose Z-Hist, a novel framework for representing and temporally abstracting histogram snapshots by converting them into a set of temporal intervals. This conversion enables the exploitation of frequent arrangement mining techniques for extracting disproportionally frequent patterns of such complex structures. Our experiments on a turbo failure dataset from a truck Original Equipment Manufacturer (OEM) demonstrate a promising use-case of Z-Hist. We also benchmark Z-Hist on six synthetic datasets for studying the relationship between distribution changes over time and disproportionality values.

An Interactive Visual Tool Enhance Understanding of Random Forest Prediction

2020. Ram B. Gurung, Tony Lindgren, Henrik Boström. Archives of Data Science, Series A 6 (1)

Article

Random forests are known to provide accurate predictions, but the predictions are not easy to understand. In order to provide support for understanding such predictions, an interactive visual tool has been developed. The tool can be used to manipulate selected features to explore what-if scenarios. It exploits the internal structure of decision trees in a trained forest model and presents these information as interactive plots and charts. In addition, the tool presents a simple decision rule as an explanation for the prediction. It also presents the recommendation for reassignments of feature values of the example that leads to change in the prediction to a preferred class. An evaluation of the tool was undertaken in a large truck manufacturing company, targeting a fault prediction of a selected component in trucks. A set of domain experts were invited to use the tool and provide feedback in post-task interviews. The result of this investigation suggests that the tool indeed may aid in understanding the predictions of random forest, and also allows for gaining new insights.

Evaluation of Dimensionality Reduction Techniques

2020. Michael Mammo, Tony Lindgren. ICCDE 2020, 75-79

Conference

One of the commonly observed phenomena in text classification problems is sparsity of the generated feature set. So far, different dimensionality reduction techniques have been developed to reduce feature spaces into a convenient size that a learner algorithm can infer. Among these, Principal Component Analysis (PCA) is one of the well-established techniques which is capable of generating an undistorted view of the data. As a result, variants of the algorithm have been developed and applied in several domains, including text mining. However, PCA does not provide backward traceability to the original features once it projected the initial features to a new space. Also, it needs a relatively large computational space since it uses all features when generating the final features. These drawbacks especially pose a problem in text classification problems where high dimensionality and sparsity are common phenomena. This paper presents a modified version PCA, Principal Feature Analysis (PFA), which enables backward traceability by choosing a subset of optimal features in the original space using the same criteria PCA uses, without involving the initial features into final computation. The proposed technique is tested against benchmark corpora and produced a comparable result as PCA while maintaining traceability to the original feature space.

Z-Miner

2020. Zed Lee, Tony Lindgren, Panagiotis Papapetrou. KDD '20, 524-534

Conference

Mining frequent patterns of event intervals from a large collection of interval sequences is a problem that appears in several application domains. In this paper, we propose Z-Miner, a novel algorithm for solving this problem that addresses the deficiencies of existing competitors by employing two novel data structures: Z-Table, a hierarchical hash-based data structure for time-efficient candidate generation and support count, and Z-Arrangement, a data structure for efficient memory consumption. The proposed algorithm is able to handle patterns with repetitions of the same event label, allowing for gap and error tolerance constraints, as well as keeping track of the exact occurrences of the extracted frequent patterns. Our experimental evaluation on eight real-world and six synthetic datasets demonstrates the superiority of Z-Miner against four state-of-the-art competitors in terms of runtime efficiency and memory footprint.

A Methodology for Prognostics Under the Conditions of Limited Failure Data Availability

2019. Gishan D. Ranasinghe (et al.). IEEE Access 7, 183996-184007

Article

When failure data are limited, data-driven prognostics solutions underperform since the number of failure data samples is insufficient for training prognostics models effectively. In order to address this problem, we present a novel methodology for generating failure data which allows training datasets to be augmented so that the number of failure data samples is increased. In contrast to existing data generation techniques which duplicate or randomly generate data, the proposed methodology is capable of generating new and realistic failure data samples. The methodology utilises the conditional generative adversarial network and auxiliary information pertaining to failure modes to control and direct the failure data generation process. The theoretical foundation of the methodology in a non-parametric setting is presented and we show that it holds in practice using empirical results. The methodology is evaluated in a real-world case study involving the prediction of air purge valve failures in heavy-trucks. Two prognostics models are developed using the gradient boosting machine and random forest classifiers. When these models are trained on the augmented training dataset, they outperformed the best solution previously proposed in the literature for the case study by a large margin. More specifically, costs due to breakdowns and false alarms are reduced by 44%.

Example-Based Feature Tweaking Using Random Forests

2019. Tony Lindgren (et al.). 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science

Conference

In certain application areas when using predictive models, it is not enough to make an accurate prediction for an example, instead it might be more important to change a prediction from an undesired class into a desired class. In this paper we investigate methods for changing predictions of examples. To this end, we introduce a novel algorithm for changing predictions of examples and we compare this novel method to an existing method and a baseline method. In an empirical evaluation we compare the three methods on a total of 22 datasets. The results show that the novel method and the baseline method can change an example from an undesired class into a desired class in more cases than the competitor method (and in some cases this difference is statistically significant). We also show that the distance, as measured by the euclidean norm, is higher for the novel and baseline methods (and in some cases this difference is statistically significantly) than for state-of-the-art. The methods and their proposed changes are also evaluated subjectively in a medical domain with interesting results.

On Data Driven Organizations and the Necessity of Interpretable Models

2019. Tony Lindgren. Smart Grid and Internet of Things, 121-130

Conference

It this paper we investigate data driven organizations in the context of predictive models, we also reflect on the need for interpretability of the predictive models in such a context. By investigating a specific use-case, the maintenance offer from a heavy truck manufacturer, we explore their current situation trying to identify areas that needs change in order to go from the current situation towards a more data driven and agile maintenance offer. The suggestions for improvements are captured in a proposed data driven framework for this type of business. The aim of the paper is that the suggested framework can inspire and start further discussions and investigations into the best practices for creating a data driven organization, in businesses facing similar challenges as in the presented use-case.

Learning Random Forest from Histogram Data Using Split Specific Axis Rotation

2018. Ram B. Gurung, Tony Lindgren, Henrik Boström. International Journal of Machine Learning and Computing 8 (1), 74-79

Article

Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time.

Random Rule Sets - Combining Random Covering with the Random Subspace Method

2018. Tony Lindgren. International Journal of Machine Learning and Computing 8 (1), 8-13

Article

Ensembles of classifiers has proven itself to be among the best methods for creating highly accurate prediction models. In this paper we combine the random coverage method which facilitates additional diversity when inducing rules using the covering algorithm, with the random subspace selection method which has been used successfully by for example the random forest algorithm. We compare three different covering methods with the random forest algorithm; 1st using random subspace selection and random covering; 2nd using bagging and random subspace selection and 3rd using bagging, random subspace selection and random covering. The results show that all three covering algorithms do perform better than the random forest algorithm. The covering algorithm using random subspace selection and random covering performs best of all methods. The results are not significant according to adjusted p-values but for the unadjusted p-value, indicating that the novel method introduced in this paper warrants further attention.

Conformal prediction using random survival forests

2017. Henrik Boström (et al.). 16th IEEE International Conference on Machine Learning and Applications, 812-817

Conference

Random survival forests constitute a robust approach to survival modeling, i.e., predicting the probability that an event will occur before or on a given point in time. Similar to most standard predictive models, no guarantee for the prediction error is provided for this model, which instead typically is empirically evaluated. Conformal prediction is a rather recent framework, which allows the error of a model to be determined by a user specified confidence level, something which is achieved by considering set rather than point predictions. The framework, which has been applied to some of the most popular classification and regression techniques, is here for the first time applied to survival modeling, through random survival forests. An empirical investigation is presented where the technique is evaluated on datasets from two real-world applications; predicting component failure in trucks using operational data and predicting survival and treatment of heart failure patients from administrative healthcare data. The experimental results show that the error levels indeed are very close to the provided confidence levels, as guaranteed by the conformal prediction framework, and that the error for predicting each outcome, i.e., event or no-event, can be controlled separately. The latter may, however, lead to less informative predictions, i.e., larger prediction sets, in case the class distribution is heavily imbalanced.

Planning Flexible Maintenance for Heavy Trucks using Machine Learning Models, Constraint Programming, and Route Optimization

2017. Jonas Biteus, Tony Lindgren. SAE International Journal of Materials & Manufacturing 10 (3), 306-315

Article

Maintenance planning of trucks at Scania have previously been done using static cyclic plans with fixed sets of maintenance tasks, determined by mileage, calendar time, and some data driven physical models. Flexible maintenance have improved the maintenance program with the addition of general data driven expert rules and the ability to move sub-sets of maintenance tasks between maintenance occasions. Meanwhile, successful modelling with machine learning on big data, automatic planning using constraint programming, and route optimization are hinting on the ability to achieve even higher fleet utilization by further improvements of the flexible maintenance. The maintenance program have therefore been partitioned into its smallest parts and formulated as individual constraint rules. The overall goal is to maximize the utilization of a fleet, i.e. maximize the ability to perform transport assignments, with respect to maintenance. A sub-goal is to minimize costs for vehicle break downs and the costs for maintenance actions. The maintenance planner takes as input customer preferences and maintenance task deadlines where the existing expert rule for the component has been replaced by a predictive model. Using machine learning, operational data have been used to train a predictive random forest model that can estimate the probability that a vehicle will have a breakdown given its operational data as input. The route optimization takes predicted vehicle health into consideration when optimizing routes and assignment allocations. The random forest model satisfactory predicts failures, the maintenance planner successfully computes consistent and good maintenance plans, and the route optimizer give optimal routes within tens of seconds of operation time. The model, the maintenance planner, and the route optimizer have been integrated into a demonstrator able to highlight the usability and feasibility of the suggested approach.

Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests

2017. Ram B. Gurung, Tony Lindgren, Henrik Boström. International Journal of Prognostics and Health Management 8 (1)

Article

Being able to accurately predict the impending failures of truck components is often associated with significant amount of cost savings, customer satisfaction and flexibility in maintenance service plans. However, because of the diversity in the way trucks typically are configured and their usage under different conditions, the creation of accurate prediction models is not an easy task. This paper describes an effort in creating such a prediction model for the NOx sensor, i.e., a component measuring the emitted level of nitrogen oxide in the exhaust of the engine. This component was chosen because it is vital for the truck to function properly, while at the same time being very fragile and costly to repair. As input to the model, technical specifications of trucks and their operational data are used. The process of collecting the data and making it ready for training the model via a slightly modified Random Forest learning algorithm is described along with various challenges encountered during this process. The operational data consists of features represented as histograms, posing an additional challenge for the data analysis task. In the study, a modified version of the random forest algorithm is employed, which exploits the fact that the individual bins in the histograms are related, in contrast to the standard approach that would consider the bins as independent features. Experiments are conducted using the updated random forest algorithm, and they clearly show that the modified version is indeed beneficial when compared to the standard random forest algorithm. The performance of the resulting prediction model for the NOx sensor is promising and may be adopted for the benefit of operators of heavy trucks.

Randomized Separate and Conquer Rule induction

2017. Tony Lindgren. Proceedings of the International Conference on Compute and Data Analysis, 207-214

Conference

Rule learning comes in many forms, here we investigate a modified version of Separate and Conquer (SAC) learning to see if it improves the predictive performance of the induced predictive models. Our modified version of SAC has a hyperparameter which is used to specify the amount of examples that should not be removed from the induction. This selection is done at random and as a consequence the SAC algorithm will produce more and diverse rules, given the hyperparameter setting. The modified algorithm has been implemented in both an unordered single rule set setting as well as in an ensemble rule set setting. Both of these settings have been evaluated empirically on a number of datasets. The results show that in the single rule set setting, the modified version significantly improves the predictive performance, at the cost of more rules, which was expected. In the ensemble setting the combined method of bagging and the modified SAC algorithm did not perform as good as expected, while using only the modified SAC algorithm in ensemble setting performed better than expected.

Indexing Rules in Rule Sets for Fast Classification

2016. Tony Lindgren. Proceedings of the International Conference on Artificial Intelligence and Robotics and the International Conference on Automation, Control and Robotics Engineering

Conference

Using sets of rules for classification of examples usually in- volves checking a number of conditions to see if they hold or not. If the rule set is large the time to make the classifica- tion can be lengthy. In this paper we propose an indexing algorithm to decrease the classification time when dealing with large rule sets. Unordered rule sets have a high time complexity when conducting classification; we hence con- duct experiments comparing our novel indexing algorithm with the standard way of classifying ensembles of unordered rule sets. The result of the experiment shows decreased clas- sification times for the novel method that are ranging from 0.6 to 0.8 of that of the standard approach averaged over all experimental datasets. This time gain is obtained while re- taining an accuracy ranging from 0.84 to 0.99 with regard to the standard classification method. The index bit size used with the indexing algorithm influence both the classification accuracy and time needed for conducting the classification task.

Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins

2016. Ram B. Gurung, Tony Lindgren, Henrik Boström. Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, 430-435

Conference

The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

Open government ideologies in post-soviet countries

2016. Karin Hansson (et al.). International Journal of Electronic Governance 8 (3), 244-264

Article

Most research in research areas like e-government, e-participation and open government assumes a democratic norm. The open government (OG) concept is commonly based on a general liberal and deliberative ideology emphasising transparency, access, participation and collaboration, but were also innovation and accountability are promoted. In this paper, we outline a terminology and suggest a method for how to investigate the concept more systematically in different policy documents, with a special emphasis on post-soviet countries. The result shows that the main focus in this regions OG policy documents is on freedom of information and accountability, and to a lesser extent on collaboration, while other aspects, such as diversity and innovation, are more rarely mentioned, if at all.

Learning Decision Trees from Histogram Data

2015. Ram B. Gurung, Tony Lindgren, Henrik Boström. Proceedings of the 2015 International Conference on Data Mining, 139-145

Conference

When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

Model Based Sampling - Fitting an Ensemble of Models into a Single Model

2015. Tony Lindgren. Proceedings of 2015 International Conference on Computational Science and Computational Intelligence, 186-191

Conference

Large ensembles of classifiers usually outperform single classifiers. Unfortunately ensembles have two major drawbacks compared to single classifier; interpretability and classifications times. Using the Combined Multiple Models (CMM) framework for compressing an ensemble of classifiers into a single classifier the problems associated with ensembles can be avoided while retaining almost similar classification power as that of the original ensemble. One open question when using CMM concerns how to generate values that constitute the synthetic example. In this paper we present a novel method for generating synthetic examples by utilizing the structure of the ensemble. This novel method is compared with other methods for generating synthetic examples using the CMM framework. From the comparison it is concluded that the novel method outperform the other methods.

An Open Government Index

2014. Tony Lindgren (et al.). DSV writers hut 2014

Conference

Most research in research areas like E-government, E-participation and Open government assume a democratic norm. The concept of Open government, recently promoted by, e.g., The Obama administration and the European Commission is to a large extent based on a general liberal and deliberative ideology emphasizing transparency, participation and collaboration. The concept has also become of interest for states like China and Singapore. In this position paper we outline how to study the concept under different political discourses and suggest an Open government index that can be used to analyze the concept of open government under various settings.

Expert Guided Adaptive Maintenance

2014. Tony Lindgren, Jonas Biteus. European Conference of the Prognostics and Health Management Society

Conference

The heavy truck industry is a highly competitive business field; traditionally maintenance plans for heavy trucks are static and not subject to change. The advent of affordable telematics solutions has created a new venue for services that use information from the truck in operation. Such services could for example aim at improving the maintenance offer by taking into account information of how a truck has been utilized to dynamically adjust maintenance to align with the truck’s actual need. These types of services for maintenance are often referred to as condition based maintenance (CBM) and more recently Integrated Vehicle Health Management (IVHM). In this paper we explain how we at Scania developed an expert system for adapting the maintenance intervals dependent on operational data from trucks. The expert system is aimed at handling components which maintenance experts have knowledge about but do not find it worth the effort to create a correct physical wear-model for. We developed a systematic way for maintenance experts to express how operational data should influence the maintenance intervals. The rules in the expert system therefore are limited in what they can express, and as such our presented system differs from other expert systems in general. In a comparison between our expert system and another general expert system framework, the expert system we constructed outperforms the general expert framework using our limited type of rules.

Improving the Maintenance Planning of Heavy Trucks using Constraint Programming

2013. Tony Lindgren, Håkan Warnquist, Martin Eineborg. ModRef 2013: The Twelfth International Workshop on Constraint Modelling and Reformulation, 74-90

Conference

Maintenance planning of heavy trucks at Scania is presently done using static cyclic plans where each maintenance occasion contains a xed set of components. Using vehicle operational data gained from on-board sensors we will be able to predict at which intervals each component needs to be maintained. However, dynamic planning is needed to take this new knowledge into account. Another benet using dynamic planning is that vehicle owners can in uence maintenance plans with regard to their business. For this reason we have implemented a prototype of an automated maintenance planner based on constraint programming techniques. The planner has successfully been tested on vehicles belonging to Scania's internal haulage contractor. In this paper we will describe the planner and what we have learned using and developing it as well as ongoing work on how the planner will be developed further.

Troubleshooting ECU Programmed by Bodybuilders

2012. Tony Lindgren. 2012 International Conference on Connected Vehicles and Expo ICCVE 201, 231-236

Conference

Having an Electronic Control Unit (ECU) which is programmable by external parties puts new requirements on troubleshooting. In this paper we describe how we solved the problem of both troubleshooting additional equipment added by bodybuilders and facilitating their need to use signals from our vehicles in an easy way in order to interact with their additional equipment. In this paper we look at bodybuilder's additional equipment for heavy trucks, but our technique for troubleshooting should be equally relevant for other applications with similar conditions.

Tony LindgrenSenior Lecturer, Associate Professor, Unit head SAS

About me

Teaching

Research

Research projects

Publications

Hybrid feature tweaking

Prediction of Global Navigation Satellite System Positioning Errors with Guarantees

Z-Hist

An Interactive Visual Tool Enhance Understanding of Random Forest Prediction

Evaluation of Dimensionality Reduction Techniques

Z-Miner

A Methodology for Prognostics Under the Conditions of Limited Failure Data Availability

Example-Based Feature Tweaking Using Random Forests

On Data Driven Organizations and the Necessity of Interpretable Models

Learning Random Forest from Histogram Data Using Split Specific Axis Rotation

Random Rule Sets - Combining Random Covering with the Random Subspace Method

Conformal prediction using random survival forests

Planning Flexible Maintenance for Heavy Trucks using Machine Learning Models, Constraint Programming, and Route Optimization

Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests

Randomized Separate and Conquer Rule induction

Indexing Rules in Rule Sets for Fast Classification

Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins

Open government ideologies in post-soviet countries

Learning Decision Trees from Histogram Data

Model Based Sampling - Fitting an Ensemble of Models into a Single Model

An Open Government Index

Expert Guided Adaptive Maintenance

Improving the Maintenance Planning of Heavy Trucks using Constraint Programming

Troubleshooting ECU Programmed by Bodybuilders