Datasets for ISCC Evaluation#

Introduction#

Dataset Types#

To test the capabilities of the ISCC, we need a collection of test data, including ground truth data. Ground truth data are essentially benchmarks that provide us with a standard of truth. They help us measure the ISCC's accuracy in detecting similar or duplicate content. These benchmarks can be established in two ways:

Real-world media file collections annotated with information about near-duplicates within the dataset. These collections offer insights into the ISCC's performance in a real-world setting, albeit contingent on the quality of annotations.
Synthetically transformed media files. We can take a unique media file and apply various modifications to it. This strategy tests the ISCC's resilience against different transformations, although the synthetic changes might not fully mirror real-world variations.

Data Folders#

To streamline processing and ensure comparability, our datasets adhere to a predefined directory structure and file naming convention. This approach implicitly encodes ground truth and optionally transformation information, simplifying the evaluation of an algorithm's performance against different datasets. Here's an example snapshot of our data folder structure:

data-folder
├── cluster1           # Collections of media files considered duplicates
│   ├── 0original.mp3  # The first file is the original (lexicographic porder)
│   ├── mod1.mp3       # Variation that should match against other files in the cluster
│   ├── mod2_rot20.mp3 # Variation with named transform
│   └── ...            # Create as many modified versions as you like
├── cluster2           # A cluster folder can have any name
│   ├── file1.mp3      # Files in cluster folders can have any name
│   └── ...            # A cluster folder may have only one file (query with no match)
├── sample1.mp3        # Top-Level files that should NOT match against any other files
├── sample2.mp3        # The ratio of distractor content vs. cluster content is relevant for metrics
└── ...

Evaluation Datasets#

Info

The content of this section is autogenerated based on latest published configuration of the TwinSpect Benchmark.

STLIB-2000#

Dataset Info

ID: e20bd16e097e8faf
Mode: Text
Size: 4.8 GB
Files: 1610

The STLIB-2000 is a dataset, designed to assess the accuracy of text identification algorithms. It includes ground truth data for a total of 1610 text files with near-duplicates organized into 805 clusters.

The STLIB-2000 is a real-world dataset of 2000 commercial E-Books where each title has an EPUB and PDF version. The data has been generously provided by StreetLib. Because the ISCC-SDK does not support OCR yet, titles with image-only E-Books have been removed before benchmarking.

Clustering Details

Each cluster contains 2 near-duplicate text files.

ISCC-FMA-10K#

Dataset Info

ID: 142e3bd331044320
Mode: Audio
Size: 69.0 GB
Files: 10000

The ISCC-FMA-10K is a dataset, designed to assess the accuracy of audio identification algorithms. It includes ground truth data for a total of 10000 audio files with near-duplicates organized into 500 clusters.

Additionally, the dataset contains 4500 unique audio files, with no corresponding duplicates within the set.

The ISCC-FMA-10k benchmark is a subset of Free Music Archive Dataset. The subset is generated by collecting 5000 random audio files (longer than 60 seconds). Additionaly 10 synthetic transformations are applied to a random selection 500 of the audio files. The Twinspect benchmark automatically downloads and reproduces the tested dataset.

Clustering Details

Each cluster contains 11 near-duplicate audio files.

Synthetic Transformations

The following transformations were applied to 500 files of the dataset to simulate different conditions that might be encountered in real-world applications:

equalize: Equalize audio (ffmpeg equalizer=f=1000:t=o:w=200:g=10)
loudnorm: Apply loudness normalization (ffmpeg loudnorm=I=-16:TP=-1.5:LRA=11)
fade-8s-both: Fade in/out 8 seconds at start and end
trim-5s-both: Remove 5 seconds of audio from start and end
transcode-aac-32kbps: Transcode audio to 32kbps AAC
transcode-mp3-128kbps: Transcode audio to 128kbps MP3
echo: Apply echo effect (ffmpeg aecho=0.8:0.7:60:0.2)
trim-1s-both: Remove 1 seconds of audio from start and end
transcode-ogg-64kbps: Transcode audio to 64kbps OGG
compress-medium: Apply audio compression (attack 10, release 200, ratio 3, threshold -20)

MIRFLICKR-MFND#

Dataset Info

ID: 71d8a361044c7b5f
Mode: Image
Size: 520.0 MB
Files: 4062

The MIRFLICKR-MFND is a dataset, designed to assess the accuracy of image identification algorithms. It includes ground truth data for a total of 4062 image files with near-duplicates organized into 1958 clusters.

The MFND benchmark (Connor et al., 2015) is a subset of the real-world MIRFLICKR dataset (Huiskes & Lew, 2008) with annotations for near duplicates (IND). The Twinspect benchmark automatically downloads and reproduces the tested dataset.

Clustering Details

Clusters contain an average of 2.07 near-duplicate image files.

Cluster sizes

Minimum: 1
Maximum: 14
Mean: 2.07
Median: 2.0

Last update: 2023-07-19
Created: 2023-07-19