Database

The SQL database of the frontend stores a subset of the data contained in the repository of the backend. The stored information can then be retrieved and visualized by the website.

Supported SQL derivates

Currently, the ClustEval backend supports MySQL and postgreSQL. The website uses postgreSQL by default because of its support for materialized views. See Repository Configuration for an explanation how to configure a database for a repository.

Tables

In the following we will give a short description of every table of the database.

  • Hint 1 : All tables that correspond to and are responsible for storing repository objects have a foreign key to the repository table. Thus for every repository object it is known, to which repository it belongs. This is not mentioned in the following descriptions.
  • Hint 2 : The abbreviation FK means Foreign Key.
  • Hint 3 : Column names are denoted in italic.
  • Hint 4 : When a run is performed, certain files are copied into a new result folder. This includes datasets, goldstandards and all configuration files. These files in the result folder are mapped to their original files they correspond to in the original repository. These relationships are stored in the database in a separate column which is named after the table name plus the postfix ” id”. For example datasets store this relationship in the column “dataset id”.
cluster_objects
Every clustering contains clusters, which again contain cluster objects. This table stores all cluster objects with their name and knows, to which cluster they belong (cluster id, FK).
clustering_quality_measure_optimums
For visualization and interpretation of results the website needs to know, whether a certain clustering quality measure is optimal when min- or maximized. This table maps every measure to ‘Minimum’ or ‘Maximum’ (name).
clustering_quality_measures
This table keeps track of all the clustering quality measures available in the framework. For every measure it stores its name, minimal and maximal value (minValue and maxValue) and whether the measure requires a goldstandard (requiresGoldStandard ). On the website every clustering quality measure has a readable alias. This alias is stored in the column (alias).
clusterings
This table holds all the clusterings that were calculated and are stored in the repository. Every parsed clustering corresponds to a file in the repository, for which we store its absolute path (absPath).
clusters
Every clustering has clusters. This table holds all clusters and maps them to their corresponding clustering. Every cluster has a name.
data_configs
A table holding all data configurations. Every data configuration has an absolute path (absPath), a name, a corresponding dataset configuration (dataset config id, FK) and a goldstandard configuration (goldstandard config id, FK).
dataset_configs
A table holding all dataset configurations. Every dataset configuration has an absolute path (absPath), a name and a corresponding dataset (dataset id, FK).
dataset_descriptions
Holds descriptions of each data set. Is seeded by the seeds.db file of the Rails app.
dataset_formats
A table holding all dataset formats. Every format has a name and an alias.
dataset_images
Contains the name of the image correspond to a data set, if it has one.
dataset_publications
Contains the references of data sets, if available.
dataset_types
This table holds all types of datasets. For every dataset type it stores a name. On the website every dataset type has a readable alias. For every dataset type this table stores the alias.
dataset_visibilities
Determines whether a data set is visible on the website (by default).
datasets
This table holds all datasets. For every dataset its absolute path (absPath), checksum, format (dataset format id, FK) and type (dataset type id, FK) is stored.
goldstandard configs
This table holds all goldstandard configurations. For every goldstandard configuration its absolute path (absPath), name and corresponding goldstandard (goldstandard id, FK) is stored.
goldstandards
This table holds all goldstandards. For every goldstandard its absolute path (absPath), and the corresponding goldstandard (goldstandard id, FK) is stored.
optimizable_program_parameters
This table stores the program parameters (program parameter id, FK) of a program configuration (program config id, FK), that can be optimized.
parameter_optimization_methods
This table stores all available parameter optimization methods (name) registered in a repository (repository id, FK).
program_configs
All program configurations registered in a repository are stored in this table. Every program configuration has a name, an absolute path (absPath), different invocation formats for different scenarios (invocationFormat, invocationFormatWithoutGoldStandard, invocationFormatParameterOptimization, invocationFormatParameterOptimizationWithoutGoldStandard ) and a boolean whether the program expects input with normalized similiarites (expectsNormalizedDataSet). For every program configuration we store the corresponding repository (repository id, FK), the program (program id, FK) this configuration belongs to, the run result format (run result format id, FK) of the program using this configuration.
program_configs_compatible_dataset_formats
Every program configuration (program config id, FK) has a set of compatible dataset formats (dataset format id, FK), which the program will understand when it is executed using this configuration.
program_parameter_types
This table contains the names (name) of the different possible program parameter types (see 4.9.7 for more information on the types of parameters supported by clusteval ).
program_parameters
This table stores the program parameters defined in a program configuration (program config id, FK). Every program parameter has a type (program parameter type id, FK), a name, an (optional) description, a minValue, a maxValue and a default value (def ).
programs
This tables stores all clustering methods together with their absPath and an alias, which is used to represent this clustering method on the website.
repositories
The repositories that are using this database to store their results. Every repository has a absolute base directory (basePath) and a type (repository type id, FK).
repository_types
The types that repositories can have. Every type has a name. Check out 4.1 for more information on which repository types exist.
run_analyses
This table holds all analysis runs. Every analysis run is also a run (run id, FK).
run_analysis_statistics
Every analysis run (run analysis id, FK) evaluates certain statistics (statistic id, FK).
run_clusterings
This table holds all clustering runs. Every clustering run is also a execution run (run execution id, FK).
run_data_analyses
This table holds all data analysis runs. Every data analysis run is also an analysis run (run analysis id, FK).
run_data_analysis_data_configs
Every data analysis run analysis a set of data configurations wrapping datasets. This table holds the data configurations (data config id, FK) that a certain data analysis run (run data analysis id, FK) analyses.
run_execution_data_configs
An execution run applies program configurations to data configurations. This table stores the data configurations (data config id, FK) belonging to execution runs (run execution id, FK).
run_execution_parameter_values
An execution run can specify values for program parameters. This table stores for every execution run (run execution id, FK), program configuration (program config id, FK) and program parameter (program parameter id, FK) the specified value.
run_execution_program_configs
An execution run applies program configurations to data configurations. This table stores the program configurations (program config id, FK) belonging to execution runs (run execution id, FK).
run_execution_quality_measures
An execution run applies clustering methods to datasets and assesses clustering quality measures. This table stores the execution run (run execution id, FK) together with the clustering quality measures (clustering quality measure id, FK) to assess.
run_executions
This table holds all execution runs. Every execution run is also a run (run id, FK).
run_internal_parameter_optimizations
This table holds all internal parameter optimization runs. Every such run is also an execution run (run execution id, FK).
run_parameter_optimization_methods
A parameter optimization run uses parameter optimization methods to optimize parameters. For a certain parameter optimization run (run parameter optimization id, FK) for every program configuration (program config id, FK) a different parameter optimization method (parameter optimization method id, FK) and a clustering quality measure to optimize can be specified.
run_parameter_optimization_parameters
A parameter optimization run optimizes parameters of clustering methods. This table stores for a certain run (run parameter optimization id, FK) for every program configuration contained (program config id, FK) the parameters (program parameter id, FK) to optimize.
run_parameter_optimization_quality_measures
A parameter optimization run optimizes parameters by maximizing or minimizing clustering quality measures. This table stores all clustering quality measures (clustering quality measure id, FK) to assess for the calculated clusterings.
run_parameter_optimizations
This table holds all parameter optimization runs. Every parameter optimization run is also an execution run (run execution id, FK).
run_result_formats
This table holds all run result formats. Every run result format has a name.
run_results
When a run is executed, it produces a unique folder in the results directory of the repository. These folders are stored in this table together with the type of the corresponding run (run id, FK) that created this result (run type id, FK), an absolute path to the results folder (absPath), the uniqueRunIdentifier of this run result (which corresponds to the name of the folder) and the date the run result was created.
run_results_analyses
When an analysis run is executed, it produces a unique folder in the results directory of the repository. Every analysis run result is also a run result (run result id, FK).
run_results_clustering_qualities
Holds the qualities for each program config, data config, parameter set of a particular run.
run_results_clusterings
When a clustering run is executed, it produces a unique folder in the results directory of the repository. Every clustering run result is also an execution run result (run results execution id, FK).
run_results_data_analyses
When a data analysis run is executed, it produces a unique folder in the results directory of the repository. Every data analysis run result is also an analysis run result (run results analysis id, FK).
run_results_executions
When an execution run is executed, it produces a unique folder in the results directory of the repository. Every execution run result is also a run result (run result id, FK).
run_results_internal_parameter_optimizations
When an internal parameter optimization run is executed, it produces a unique folder in the results directory of the repository. Every internal parameter optimization run result is also an execution run result (run results execution id, FK).
run_results_parameter_optimizations
When a parameter optimization run is executed, it produces a set of iterative run results, that are all summarized in .complete-files for every pair of program and data configuration (program config id, FK) and (data config id, FK). Every parameter optimization run result is also an execution run result (run results execution id, FK).
run_results_parameter_optimizations_parameter_set_iterations
Every parameter optimization produces clustering results for a set of iterations. In each iteration a different parameter set (run results parameter optimizations parameter set id, FK) is evaluated. This table holds the number of the iteration, together with the parameter set, the produced clustering (clustering id, FK) and the parameter set in a string representation (paramSetAsString).
run_results_parameter_optimizations_parameter_set_parameters
This table holds the program parameters (program parameter id, FK) that belong to parameter sets (run results parameter optimizations parameter set id, FK) contained in a parameter optimization run result (run results parameter optimization id, FK), evaluated by the framework.
run_results_parameter_optimizations_parameter_sets
This table holds the parameter sets evaluated during a parameter optimization run and contained in a parameter optimization run result (run results parameter optimization id, FK).
run_results_parameter_optimizations_parameter_values
This table contains the value of a certain program parameter (run results parameter optimizations parame FK) evaluated in a certain parameter optimization iteration (run results parameter optimizations parame FK).
run_results_parameter_optimizations_qualities
In every iteration of a parameter optimization a parameter set is evaluated and clus- tering qualities are assessed. This table holds the clustering quality measure (cluster quality measure id, FK) together with the assessed quality.
run_results_run_analyses
When a run analysis run is executed, it produces a unique folder in the results directory of the repository. Every run analysis run result is also an analysis run result (run results analysis id, FK).
run_results_run_data_analyses
When a run-data analysis run is executed, it produces a unique folder in the results directory of the repository. Every run-data analysis run result is also an analysis run result (run results analysis id, FK).
run_run_analyses
This table holds all run analysis runs. Every run analysis run is also an analysis run (run analysis id, FK).
run_run_analysis_run_identifiers
Every run analysis run analyses (run run analysis id, FK) a set of run results with certain identifiers (runIdentifier ).
run_run_data_analyses
This table holds all run-data analysis runs. Every run-data analysis run is also an analysis run (run analysis id, FK).
run_run_data_analysis_data_identifiers
Every run-data analysis run analyzes (run run analysis id, FK) a set of data analysis run results with certain identifiers (dataIdentifier ).
run_run_data_analysis_run_identifiers
Every run-data analysis run analyzes (run run analysis id, FK) a set of execution run results with certain identifiers (runIdentifier ).
run_types
This table holds all types of runs. Every type has a name. Check out 6 for more information on which run types exist.
runs
This table holds all runs. Every run has a type (run type id, FK), an absolute path (absPath), a name and a status.
statistics
This table holds all the statistics registered in a repository. Every statistic has a name and an alias. The alias is used on the website as a readable name.
statistics_data
This table holds all the data statistics. Every data statistic is also a statistic (statistic id, FK).
statistics_runs
This table holds all the run statistics. Every run statistic is also a statistic (statistic id, FK).
statistics_run_data
This table holds all the run-data statistics. Every run-data statistic is also a statistic (statistic id, FK).

Technical Tables

The following tables correspond to rather technical tables, that are just required by the models of the ruby on rails website and do not have a strong meaning with regard to contents.
[aboutus]
A technical table containing the information regarding the ‘About us’ section of the website.
[aboutus_impressums]
A technical table containing the information regarding the ‘Impressum’ section of the website.
[admins]
A technical table containing the information regarding the ‘Admin’ section of the website.
[help_installations]
Models of the help installation section of the website.
[help_publications]
Models of the help publications section of the website.
[help_source_codes]
Models of the help source code section of the website.
[help_technical_documentations]
Models of the help technical documentations section of the website.
[helps]
A technical table containing the information regarding the ‘Help’ section of the website.
[impressions]
A technical table containing the impressions of the website.
[mains]
A technical table containing the information regarding the startpage of the website.
[program descriptions]
This table stores descriptions of clustering methods, for when they are shown on the website.
[program images]
When this table contains an image for a clustering method, it will be shown on the website.
[program publications]
When this table contains publication information for a clustering method, it will be shown on the website.
[schema_migrations]
This table holds all the migrations of the ruby on rails website.
[small_rankings]
ClustEval makes use of reusable modules in its website; one of them is the small ranking cell, which shows highcharts diagrams.
[submit_datasets]
This table corresponds to the section “Submit Dataset” of the website.
[submit_methods]
This table corresponds to the section “Submit Clustering Method” of the website.
[submits]
This table corresponds to the “Submit” section of the website.
[users]
This table holds all the users, that have registered on the website.

Materialized Views

dataset_statistics
Lists all statistics for a particular data set
dataset_recent_statistics
Lists all statistics for a particular data set, but only the most recent version of each statistic.
parameter_optimization_data_configs_iterations
Lists all parameter optimization iterations that have ever been calculated per data configuration.
parameter_optimization_iterations
Lists all parameter optimization iterations.
parameter_optimization_iterations_exts
Additionally joins the above information with the corresponding data set and clustering method.
parameter_optimization_iteration_exts_configs
Joins the information of parameter_optimization_iterations with the corresponding original data configuration and clustering method.
parameter_optimization_iterations_woparam
Holds the same information as parameter_optimization_iterations but without the parameter set.
parameter_optimization_max_qual_rows
Lists those parameter optimization iterations (including all information such as used parameter set) which achieved the highest qualities for a particular quality measure, clustering method and data set.
parameter_optimization_max_quals
Lists the highest achieved qualities for a particular quality measure, clustering method and data set (this is useful for quality measures that are best if maximized).
parameter_optimization_min_quals
Lists the lowest achieved quality values for a particular quality measure, clustering method and data set (this is useful for quality measures that are best if minimized).
run_result_data_analysis_data_configs_statistics
A data analysis run assesses statistics for certain data configurations. This table stores the assessed statistics (statistic id, FK) for every run result (run result id, FK) generated by an analysis run.
run_results_data_configs_rankings
Holds the best performances of all clustering methods and clustering quality measures for one particular data configuration.
run_results_program_configs_rankings
Holds the best performances of one particular program configuration for all clustering quality measures and data configurations.