Configuration Files - Handling ClustEval

Above we already discussed that the framework needs program configurations, dataset configurations and goldstandard configurations. Those configuration files directly reference corresponding files (dataset, goldstandard, … ) on the filesys- tem. Internally the framework has some abstraction layers to store all the configura- tions. Figure 5 shows the overall abstractional structure of the configuration files used in the backend. One can see that dataset- and goldstandard configuration are linked together in a data configuration.

A run is an abstract entity that can be performed by the backend. Its execution involves (in most cases) application of clustering methods to several datasets, and after- wards clustering qualities are assessed using the goldstandards corresponding to each dataset. A run corresponds to a run configuration file, which then again references the program- and data configurations that should be pairwise combined.

When a run is performed by the backend, the clustering methods wrapped by all ref- erenced program configurations are applied to all datasets indirectly referenced through the data configurations.

Data Configurations

A data configuration is a file, that combines two other configurations together: A dataset configuration and a goldstandard configuration. When you create a run and in this run you want to apply two clustering methods to three datasets (together with their goldstandards) you will do so by telling the run configuration the names of the three corresponding data configurations. Please note: The data configuration file has to have the file extension .dataconfig, otherwise it will not be recognized by the framework.

See DataConfigParser.parseFromFile(File) for options that are respected when the file is parsed.

Example

A data configuration could look as follows:

datasetConfig = astral_1
goldstandardConfig = astral_1_161

Dataset Configuration

A dataset configuration tells the framework meta information about the corresponding dataset. That is: The internal name of the dataset, its filename and its format. datasetName: This name is used to find and access datasets within the framework. The name of the dataset has to be identical to the subfolder of the corresponding dataset.

See DataSetConfigParser.parseFromFile(File) for options that are respected when the file is parsed.

Example

A dataset configuration could look as follows:

datasetName = astral_1_161
datasetFile = blastResults.txt
datasetFormat = BLASTDataSetFormat
distanceMeasureAbsoluteToRelative = EuclidianDistanceMeasure

GoldStandard Configuration

A goldstandard configuration tells the framework meta information about the corre- sponding goldstandard. That is: The internal name of the goldstandard and its file- name.

See GoldStandardConfigParser.parseFromFile(File) for options that are respected when the file is parsed.

Example

A goldstandard configuration could look as follows:

goldstandardName = astral_1_161
goldstandardFile = astral_161.goldstandard_3.txt

Program Configuration

For every clustering method there can be several configuration files. All program configurations have to be located in <REPOSITORY ROOT>/programs/configs . A program configuration tells the framework, what parameters the program expects, how to invoke the executable, with what parameter values to invoke it and several other information. Possible entries in a program configuration follow.

Please note: The program configuration file has to have the file extension .config , otherwise it will not be recognized by the framework.

See ProgramConfigParser.parseFromFile(File) for options that are respected when the file is parsed.

Example

A program configuration could look as follows:

program = APcluster
parameters = preference,maxits,convits,dampfact
optimizationParameters = preference,maxits,convits,dampfact
executable = apcluster
compatibleDataSetFormats = APRowSimDataSetFormat
outputFormat = APRunResultFormat

[invocationFormat]
invocationFormat = %e %i %preference %o maxits=%maxits convits=%convits
dampfact=%dampfact

[maxits]
desc = Max iterations
type = 1
def = 2000
minValue = 2000
maxValue = 5000

[convits]
desc = Cluster Center duration
type = 1
def = 200
minValue = 200
maxValue = 500

[dampfact]
type = 2
def = 0.9
minValue = 0.7
maxValue = 0.99

[preference]
desc = Preference
type = 2
def = 0.5
minValue = 0.0
maxValue = 1.0

Runs

Runs are entities that can be performed by the backend server. A run is defined by a file in the folder <REPOSITORY ROOT>/runs . The name of that file (without extension) also defines the name of the run. Depending on the type of the run this file contains several other components which configure the process when the run is performed. Figure 6 shows the different types of runs and how they relate to each other.

Every run is defined in a run-file in the corresponding folder of the repository. Depending on the type of the run, different options are available that can be specified in the run-file. Common to all types of run files are the following options:

See RunParser.parseFromFile(File) for options that are respected when the file is parsed.

Execution Runs

Execution runs calculate clusterings during their execution and assess qualities for every of those clusterings. Clusterings are calculated by applying clustering methods to datasets using a certain parameter set. That is why execution runs have sets of both, program and data configurations. During execution time every program configuration is applied to every data configuration in a pairwise manner. For every calculated clustering a set of clustering quality measures are assessed.

In general the options of such a combination of data and program configuration will be taken from these configurations respectively, but can be overridden by the options in the run configuration, That means parameter values defined in the program as well as in the run configuration will be taken from the latter.

For execution runs, additionally to the options defined for all runs (see above), the following options for the run-file are defined:

See ExecutionRunParser.parseFromFile(File) for options that are respected when the file is parsed.

Clustering Runs

Clustering runs are a type of execution run, that means they calculate clusterings by applying every program configuration to every data configuration. Afterwards they assess the qualities of those clusterings in terms of several clustering quality measures.

In the case of clustering runs for every pair of program and data configuration exactly one clustering is calculated and assessed. Clustering runs are visualized in figure 7.

For clustering runs, the options are the same as for all execution runs (see Execution Run Files).

Parameter Optimization Runs

Parameter optimization runs are a type of execution run, that means they calculate clusterings by applying every program configuration to every data configuration. Afterwards they assess the qualities of those clusterings in terms of several clustering quality measures.

In contrast to clustering runs, parameter optimization runs calculate several clusterings for every pair of data and program configuration in a pairwise manner. Every clustering corresponds to a certain parameter set and the parameter sets to evaluate are determined by a parameter optimization method (see 4.8 for more information). Parameter optimization runs are visualized in figure 8.

For parameter optimization runs, additionally to the options defined for all execution runs (see Execution Run Files), the following options for the run-file are defined:

See ParameterOptimizationRunParser.parseFromFile(File) for options that are respected when the file is parsed.

Example

A parameter optimization run could look as follows:

programConfig = APcluster_1,TransClust_2,MCL_1
dataConfig = astral_1_171
qualityMeasures = TransClustF2ClusteringQualityMeasure,SilhouetteValueRClusteringQualityMeasure
mode = parameter_optimization
optimizationMethod = DivisiveParameterOptimizationMethod
optimizationCriterion = SilhouetteValueRClusteringQualityMeasure
optimizationIterations = 1001

[TransClust_2]
optimizationParameters = T

[MCL_1]
optimizationParameters = I

[APcluster_1]
optimizationParameters = preference,dampfact,maxits,convits
optimizationMethod = APDivisiveParameterOptimizationMethod

Analysis Runs

Analysis runs assess certain properties of objects of interest. An analysis run has a set of target objects and a set of statistics, that should be assessed for each of the target objects. That means, during execution time for every target object every statistic is assessed in a pairwise manner.

Data Analysis Runs

In case of data analysis runs the target objects to analyze are data configurations (indirectly datasets) and the statistics are data statistics, that is properties of datasets. Data analysis runs are visualized in figure 9.

For data analysis runs the following options for the run-file are defined:

See DataAnalysisRunParser.parseFromFile(File) for options that are respected when the file is parsed.

Run Analysis Runs

In case of run analysis runs the target objects to analyze are clusterings (results of execution runs) and the statistics are run statistics, that is properties of execution run results. Run analysis runs are visualized in figure 10.

For run analysis runs the following options for the run-file are defined:

See RunAnalysisRunParser.parseFromFile(File) for options that are respected when the file is parsed.

Run-Data Analysis Runs

In case of run-data analysis runs the target objects to analyze are pairs of data configurations and clusterings (results of execution runs) and the statistics are run-data statistics, that is relationships between execution run results and properties of data configurations. Run-Data analysis runs are visualized in figure 11.

For run-data analysis runs the following options for the run-file are defined:

See RunDataAnalysisRunParser.parseFromFile(File) for options that are respected when the file is parsed.

Robustness Analysis Run

A robustness analysis run can be used to measure the effect of changes in clustering performances of methods on data sets. The run first parses the best performances and corresponding parameter sets of clustering methods on data sets in a set of run results. Next, these original data configurations are distorted and clustering methods are executed on these with the same parameter sets that originally lead to the best performances.

The randomizers are parameterized, i.e. they have parameters that will lead to different distorted data configurations. One robustness run will generate a certain number of distorted data configurations for each randomizer parameter set. The robustness analysis run file provides the following options:

See RobustnessAnalysisRunParser.parseFromFile(File) for options that are respected when the file is parsed.