epyt_flow.data.benchmarks

epyt_flow.data.benchmarks.leakdb

LeakDB (Leakage Diagnosis Benchmark) by Vrachimis, S. G., Kyriakou, M. S., Eliades, D. G., and Polycarpou, M. M. (2018), is a realistic leakage dataset for water distribution networks. The dataset is comprised of 1000 artificially created but realistic leakage scenarios, on different water distribution networks, under varying conditions.

See https://github.com/KIOS-Research/LeakDB/ for details.

This module provides functions for loading the original LeakDB data set load_data(), as well as methods for loading the scenarios load_scenarios() and pre-generated SCADA data load_scada_data(). The official scoring/evaluation is implemented in compute_evaluation_score() – i.e. those results can be directly compared to the official paper.

epyt_flow.data.benchmarks.leakdb.compute_evaluation_score(scenarios_id: list[int], use_net1: bool, y_pred_labels_per_scenario: list[numpy.ndarray]) → dict[source]

Evaluates the predictions (leakage detection) for a list of given scenarios.

Parameters:

scenarios_id (list[int]) – List of scenarios ID that are to be evaluated – there is a total number of 1000 scenarios.
use_net1 (bool) – If True, Net1 LeakDB will be used for evaluation, otherwise the Hanoi LeakDB will be used.
y_pred_labels_per_scenario (list[numpy.ndarray]) – Predicted binary labels (over time) for each scenario in scenarios_id.

Returns:

Dictionary containing the f1-score, true positive rate, true negative rate, and early detection score.

Return type:

dict

epyt_flow.data.benchmarks.leakdb.load_data(scenarios_id: list[int], use_net1: bool, download_dir: str | None = None, return_X_y: bool = False, return_features_desc: bool = False, return_leak_locations: bool = False, verbose: bool = True) → dict[source]

Loads the original LeakDB benchmark data set.

Warning

All scenarios together are a huge data set – approx. 8GB for Net1 and 25GB for Hanoi. Downloading and loading might take some time! Also, a sufficient amount of hard disk memory is required.

Parameters:

scenarios_id (list[int]) – List of scenarios ID that are to be loaded – there are a total number of 1000 scenarios.
use_net1 (bool) – If True, Net1 LeakDB will be loaded, otherwise the Hanoi LeakDB will be loaded.
download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels (presence of a leakage) as two Numpy arrays, otherwise, the data is returned as Pandas data frames.

The default is False.
return_features_desc (bool, optional) –
If True and if return_X_y is True, the returned dictionary contains the features’ descriptions (i.e. names) under the key “features_desc”.

The default is False.
return_leak_locations (bool) –
If True and if return_X_y is True, the leak locations are returned as well – as an instance of scipy.sparse.bsr_array.

The default is False.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

Dictionary containing the scenario data sets. Data of each requested scenario can be accessed by using the scenario ID as a key.

Return type:

dict

epyt_flow.data.benchmarks.leakdb.load_scada_data(scenarios_id: list[int], use_net1: bool = True, download_dir: str | None = None, return_X_y: bool = False, return_leak_locations: bool = False, verbose: bool = True) → list[ScadaData] | list[tuple[numpy.ndarray, numpy.ndarray]][source]

Loads the SCADA data of the simulated LeakDB benchmark scenarios – see load_scenarios().

Note

Note that due to the randomness in the demand creation as well as in the model uncertainties, the SCADA data differs from the original data set which can be loaded by calling load_data(). However, the leakages (i.e. location and profile) are consistent with the original data set.

Parameters:

scenarios_id (list[int]) – List of scenarios ID that are to be loaded – there are a total number of 1000 scenarios.
use_net1 (bool, optional) –
If True, Net1 LeakDB will be loaded, otherwise the Hanoi LeakDB will be loaded.

The default is True.
download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels (presence of a leakage) as two Numpy arrays, otherwise, the data is returned as ScadaData instances.

The default is False.
return_leak_locations (bool) –
If True, the leak locations are returned as well – as an instance of scipy.sparse.bsr_array.

The default is False.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The simulated benchmark scenarios as either a list of ScadaData instances or as a list of (X, y) Numpy arrays. If ‘return_leak_locations’ is True, the leak locations are included as an instance of scipy.sparse.bsr_array as well.

Return type:

list[:class:`~epyt_flow.simulation.scada.scada_data.ScadaData] or list[tuple[numpy.ndarray, numpy.ndarray]]

epyt_flow.data.benchmarks.leakdb.load_scenarios(scenarios_id: list[int], use_net1: bool = True, download_dir: str | None = None, verbose: bool = True) → list[ScenarioConfig][source]

Creates and returns the LeakDB scenarios – they can be either modified or passed directly to the simulator ScenarioSimulator.

Note

Note that due to the randomness in the demand creation as well as in the model uncertainties, the simulation results will differ between different runs, and will also differ from the original data set (see load_data()). However, the leakages (i.e. location and profile) will be always the same and be consistent with the original data set.

Parameters:

scenarios_id (list[int]) – List of scenarios ID that are to be loaded – there is a total number of 1000 scenarios.
use_net1 (bool, optional) –
If True, Net1 network will be used, otherwise the Hanoi network will be used.

The default is True.
download_dir (str, optional) –
Path to the Net1.inp or Hanoi.inp file – if None, the temp folder will be used. If the path does not exist, the .inp will be downloaded to the give path.

The default is None.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

LeakDB scenarios.

Return type:

list[ScenarioConfig]

epyt_flow.data.benchmarks.leakdb.true_negative_rate(y_pred: numpy.ndarray, y: numpy.ndarray) → float[source]

Computes the true negative rate (also called specificity).

Parameters:

y_pred (numpy.ndarray) – Predicted labels.
y – Ground truth labels.

Returns:

True negative rate.

Return type:

float

epyt_flow.data.benchmarks.battledim

The Battle of the Leakage Detection and Isolation Methods (BattLeDIM) 2020, organized by S. G. Vrachimis, D. G. Eliades, R. Taormina, Z. Kapelan, A. Ostfeld, S. Liu, M. Kyriakou, P. Pavlou, M. Qiu, and M. M. Polycarpou, as part of the 2nd International CCWI/WDSA Joint Conference in Beijing, China, aims at objectively comparing the performance of methods for the detection and localization of leakage events, relying on SCADA measurements of flow and pressure sensors installed within water distribution networks.

See https://github.com/KIOS-Research/BattLeDIM for details.

This module provides functions for loading the original BattLeDIM data set load_data(), as well as methods for loading the scenarios load_scenario() and pre-generated SCADA data load_scada_data(). The official scoring/evaluation is implemented in compute_evaluation_score() – i.e. those results can be directly compared to the official leaderboard results. Besides this, the user can choose to evaluate predictions using any other metric from metrics.

epyt_flow.data.benchmarks.battledim.compute_evaluation_score(y_leak_locations_pred: list[tuple[str, int]], test_scenario: bool, verbose: bool = True) → dict[source]

Evaluates the predictions (i.e. start time and location of leakages) as it was done in the BattLeDIM competition – i.e. the output of this functions can be directly compared to the official leaderboard results.

Parameters:

y_leak_locations_pred (list[tuple[str, int]]) – Predictions of location (link/pipe ID) and start time (in seconds since simulation start) of leakages.
test_scenario (bool) – True if the given predictions are made for the test scenario, False otherwise.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

Dictionary containing the true positive rate, true positives, false positives, false negatives, and total monetary (Euro) savings (only available if test_scenario is True).

Return type:

dict

epyt_flow.data.benchmarks.battledim.load_data(return_test_scenario: bool, download_dir: str | None = None, return_X_y: bool = False, return_features_desc: bool = False, return_leak_locations: bool = False, verbose: bool = True) → pandas.DataFrame | Any[source]

Loads the original BattLeDIM benchmark data set. Note that the data set exists in two different version – a training version and an evaluation/test version.

Parameters:

return_test_scenario (bool) – If True, the evaluation/test data set is returned, otherwise the historical (i.e. training) data set is returned.
download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels (presence of a leakage) as two Numpy arrays, otherwise, the data is returned as a ScadaData instance.

The default is False.
return_features_desc (bool, optional) –
If True and if return_X_y is True, the returned dictionary contains the features’ descriptions (i.e. names) under the key “features_desc”.

The default is False.
return_leak_locations (bool) –
If True, the leak locations are returned as well – as an instance of scipy.sparse.bsr_array.

The default is False.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

Benchmark data set.

Return type:

Either a pandas.DataFrame instance or a tuple of Numpy arrays.

epyt_flow.data.benchmarks.battledim.load_scada_data(return_test_scenario: bool, download_dir: str | None = None, return_X_y: bool = False, return_leak_locations: bool = False, verbose: bool = True) → list[ScadaData | Any][source]

Loads the SCADA data of the simulated BattLeDIM benchmark scenario – note that due to randomness, these differ from the original data set which can be loaded by calling load_data().

Warning

A large file (approx. 4GB) will be downloaded and loaded into memory – this might take some time.

Parameters:

return_test_scenario (bool) – If True, the evaluation/test scenario is returned, otherwise the historical (i.e. training) scenario is returned.
download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels (presence of a leakage) as two Numpy arrays, otherwise, the data is returned as a ScadaData instance.

The default is False.
return_leak_locations (bool) –
If True, the leak locations are returned as well – as an instance of scipy.sparse.bsr_array.

The default is False.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The simulated benchmark scenario as either a ScadaData instance or as a tuple of (X, y) Numpy arrays. If ‘return_leak_locations’ is True, the leak locations are included as an instance of scipy.sparse.bsr_array as well.

Return type:

ScadaData or list[tuple[numpy.ndarray, numpy.ndarray]]

epyt_flow.data.benchmarks.battledim.load_scenario(return_test_scenario: bool, download_dir: str | None = None, verbose: bool = True) → ScenarioConfig[source]

Creates and returns the BattLeDIM scenario – it can be either modified or passed directly to the simulator ScenarioSimulator.

Note

Note that due to randomness, the simulation results differ from the original data set which can be loaded by calling load_data().

Parameters:

return_test_scenario (bool) – If True, the evaluation/test scenario is returned, otherwise the historical (i.e. training) scenario is returned.
download_dir (str, optional) –
Path to the L-TOWN.inp file – if None, the temp folder will be used. If the path does not exist, the .inp will be downloaded to the given path.

The default is None.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

Complete scenario configuration of the BattLeDIM benchmark scenario.

Return type:

ScenarioConfig

epyt_flow.data.benchmarks.batadal

The BATtle of the Attack Detection ALgorithms (BATADAL) by Riccardo Taormina, Stefano Galelli, Nils Ole Tippenhauer, Avi Ostfeld, Elad Salomons, Demetrios Eliades is a competition on planning and management of water networks undertaken within the Water Distribution Systems Analysis Symposium. The goal of the battle was to compare the performance of algorithms for the detection of cyber-physical attacks, whose frequency has increased in the last few years along with the adoption of smart water technologies. The design challenge was set for the C-Town network, a real-world, medium-sized water distribution system operated through programmable logic controllers and a supervisory control and data acquisition (SCADA) system. Participants were provided with data sets containing (simulated) SCADA observations, and challenged to design an attack detection algorithm. The effectiveness of all submitted algorithms was evaluated in terms of time-to-detection and classification accuracy. Seven teams participated in the battle and proposed a variety of successful approaches leveraging data analysis, model-based detection mechanisms, and rule checking. Results were presented at the Water Distribution Systems Analysis Symposium (World Environmental and Water Resources Congress) in Sacramento, California on May 21-25, 2017. The paper summarizes the BATADAL problem, proposed algorithms, results, and future research directions.

See https://www.batadal.net/ for details.

This module provides functions for loading the original BATADAL data set load_data(), as well as functions for loading the scenarios load_scenario() and pre-generated SCADA data load_scada_data().

epyt_flow.data.benchmarks.batadal.load_data(download_dir: str | None = None, return_X_y: bool = False, return_ground_truth: bool = False, return_features_desc: bool = False, verbose: bool = True) → dict[source]

Loads the original BATADAL competition data.

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data together with the labels is returned as pairs of Numpy arrays. Otherwise, the data is returned as Pandas data frames.

The default is False.
return_ground_truth (bool) –
If True and if return_X_y is True, the ground truth labels are included in the returned dictionary – note that the labels provided in the benchmark constitute a partial labeling only.

The default is False.
return_features_desc (bool) –
If True and if return_X_y is True, feature names (i.e. descriptions) are included in the returned dictionary.

The default is False.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

Dictionary of the loaded benchmark data. The dictionary contains the two training data sets (“train_1” and “train_2”), as well as the test data set (“test”). If return_X_y is False, each dictionary entry is a Pandas dataframe. Otherwise, it is a tuple of sensor readings and labels (except for the test set) – if return_ground_truth is True or return_features_desc is True, the corresponding data is appended to the tuple.

Return type:

dict

epyt_flow.data.benchmarks.batadal.load_scada_data(download_dir: str | None = None, return_X_y: bool = False, return_ground_truth: bool = False, return_features_desc: bool = False, verbose: bool = True) → Any[source]

Loads the SCADA data of the simulated BATADAL benchmark scenario – note that due to randomness and undocumented aspects of the original BATADAL data set, these differ from the original data set which can be loaded by calling load_data().

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data together with the labels is returned as pairs of Numpy arrays. Otherwisen the data is returned as Pandas data frames.

The default is False.
return_ground_truth (bool) –
If True and if return_X_y is True, the ground truth labels are included in the returned dictionary – note that the labels provided in the benchmark constitute a partial labeling only.

The default is False.
return_features_desc (bool) –
If True and if return_X_y is True, feature names (i.e. descriptions) are included in the returned dictionary.

The default is False.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

epyt_flow.data.benchmarks.batadal.load_scenario(download_dir: str | None = None, verbose: bool = True) → ScenarioConfig[source]

Creates and returns the BATADAL scenario – it can be either modified or directly passed to the simulator ScenarioSimulator.

Note

Note that due to randomness and undocumented aspects of the original BATADAL benchmark, the scenario simulation results differ from the original data set which can be loaded by calling load_data().

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The BATADAL scenario.

Return type:

ScenarioConfig

epyt_flow.data.benchmarks.water_usage

Module provides a function for loading the water usage data set by P. Pavlou et al.

epyt_flow.data.benchmarks.water_usage.compute_evaluation_score(y_pred: numpy.ndarray, y: numpy.ndarray) → dict[source]

Evaluates the performance of a detection method.

Note that instead of a single metric, the following set of metrics is used:

Accuracy
Precision
F1-score (using “micro” averaging)
Cohen’s kappa
ROC AUC

Parameters:

y_pred – Event indication prediction over time
y – Ground truth event indication over time.

Returns:

All evaluation scores.

Return type:

dict

epyt_flow.data.benchmarks.water_usage.load_water_usage(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) → dict[source]

“Monitoring domestic water consumption: A comparative study of model-based and data-driven end-use disaggregation methods” by P. Pavlou, S. Filippou, S. Solonos, S. G. Vrachimis, K. Malialis, D. G. Eliades, T. Theocharides, M. M. Polycarpou is a benchmark concerning the monitoring of water usage of different household appliances. Informing consumers about it has been shown to have an impact on their behavior toward drinking water conservation. The data were created using the STochastic Residential water End-use Model (STREaM) (Cominola et al., 2018), a modelling software developed that generates synthetic time series data of a household.

This benchmark data set is for identifying active appliances from the aggregated water consumption – i.e. a multi-class classification probelm. The data set considers the use of standard toilet, standard shower, standard faucet, high efficiency clothes washer, and standard dishwasher in a 2-person household for a period of 180 days (6 months) and it has a resolution of 10s. The data set is already split into 3 sub-sets for training (90 days), validation (45 days), and testing (45 days).

For more information see https://github.com/KIOS-Research/Water-Usage-Dataset/

Note

Note that although this data set is synthetic, only the final data set is provided.

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the multi-class labels as two Numpy arrays, otherwise, the data is returned as Pandas data frame.

The default is True.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The data set as a dictionary with entries “train”, “validation”, and “test” containing the respective data.

Return type:

dict

epyt_flow.data.benchmarks.gecco_water_quality

Module provides functions for loading different GECCO water quality data sets.

GECCO Water Quality 2017	`load_gecco2017_water_quality_data()`
GECCO Water Quality 2018	`load_gecco2018_water_quality_data()`
GECCO Water Quality 2019	`load_gecco2019_water_quality_data()`

Note that the scoring/evaluation algorithm is the same for all GECCO water quality benchmarks and is implemented in compute_evaluation_score().

epyt_flow.data.benchmarks.gecco_water_quality.compute_evaluation_score(y_pred: numpy.ndarray, y: numpy.ndarray) → float[source]

Evaluates the performance of a detection method.

Note

All GECCO water quality challenges use the F1-score for evaluation.

Parameters:

y_pred – Event indication prediction over time
y – Ground truth event indication over time.

Returns:

Evaluation score.

Return type:

float

epyt_flow.data.benchmarks.gecco_water_quality.load_gecco2017_water_quality_data(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) → pandas.DataFrame | tuple[numpy.ndarray, numpy.ndarray][source]

GECCO Industrial Challenge 2017 Dataset: A water quality dataset for the “Monitoring of drinking-water quality” competition organized by M. Friese, J. Stork, A. Fischbach, M. Rebolledo, T. Bartz-Beielstein at the Genetic and Evolutionary Computation Conference 2017, Berlin, Germany

This is a benchmark for anomaly detection algorithms on water quality. The data is provided by the “Thüringer Fernwasserversorgung” (Germany) and constitutes a real-world data set. In this data set, 9 numeric water quality features are given at a sampling rate of 1 min over approx. 3 month. The goal is to predict the presence of an anomaly – i.e. binary classification.

More information can be found at https://zenodo.org/records/3884465 and http://www.spotseven.de/gecco-challenge/gecco-challenge-2017/

Note

Note that this is NOT a simulated scenario and therefore only the final data set is provided.

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels as two Numpy arrays, otherwise the data is returned as Pandas data frame.

The default is True.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The benchmark data set as either a Pandas data frame or as a pair of (X, y) Numpy arrays.

Return type:

pandas.DataFrame or tuple[numpy.ndarray, numpy.ndarray]

epyt_flow.data.benchmarks.gecco_water_quality.load_gecco2018_water_quality_data(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) → pandas.DataFrame | tuple[numpy.ndarray, numpy.ndarray][source]

GECCO Industrial Challenge 2018 Dataset: A water quality dataset for the “Internet of Things: Online Anomaly Detection for Drinking Water Quality” competition organized by F. Rehbach, M. Rebolledo, S. Moritz, S. Chandrasekaran, T. Bartz-Beielstein at the Genetic and Evolutionary Computation Conference 2018, Kyoto, Japan.

This is a benchmark (based on load_gecco2017_water_quality_data()) for anomaly detection algorithms on water quality. The data is provided by the “Thüringer Fernwasserversorgung” (Germany) and constitutes a real-world data set. In this data set, 9 numeric water quality features are given at a sampling rate of 1 min over approx. 3 month. The goal is to predict the presence of an anomaly – i.e. binary classification.

More information can be found at https://zenodo.org/records/3884398 and http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2018/

Note

Note that this is NOT a simulated scenario and therefore only the final data set is provided.

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels as two Numpy arrays, otherwise the data is returned as Pandas data frame.

The default is True.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The benchmark data set as either a Pandas data frame or as a pair of (X, y) Numpy arrays.

Return type:

pandas.DataFrame or tuple[numpy.ndarray, numpy.ndarray]

epyt_flow.data.benchmarks.gecco_water_quality.load_gecco2019_water_quality_data(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) → dict[source]

GECCO Industrial Challenge 2019 Dataset: A water quality dataset for the “Internet of Things: Online Event Detection for Drinking Water Quality Control” competition organized by F. Rehbach, S. Moritz, T. Bartz-Beielstein at the Genetic and Evolutionary Computation Conference 2019, Prague, Czech Republic.

This is a benchmark (based on load_gecco2018_water_quality_data()) for anomaly detection algorithms on water quality. The data is provided by the “Thüringer Fernwasserversorgung” (Germany) and constitutes a real-world data set. In this data set, 6 numeric water quality features are given at a sampling rate of 1 min over approx. 3 month. The goal is to predict the presence of an anomaly – i.e. binary classification. The data set itself comes in three splits: A train set, a validation set, and a test set.

More information can be found at https://zenodo.org/records/4304080 and https://www.th-koeln.de/informatik-und-ingenieurwissenschaften/gecco-challenge-2019_63244.php

Note

Note that this is NOT a simulated scenario and therefore only the final data set is provided.

Parameters:

download_dir (str, optional) –
Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

The default is None.
return_X_y (bool, optional) –
If True, the data is returned together with the labels as two Numpy arrays, otherwise the data is returned as Pandas data frame.

The default is True.
verbose (bool, optional) –
If True, a progress bar is shown while downloading files.

The default is True.

Returns:

The data set as a dictionary with entries “train”, “validation”, and “test” containing the respective data.

Return type:

dict