INSTANCEMATCHING.ORG
How to discover similar objects in heterogeneous data sources
A minor flaw in current IIMB 2009 version PDF Print E-mail
Written by Christian Meilicke   
Friday, 11 December 2009 08:23

The current IIMB 2009 version has some really nice features. In particular, the stronger T-Box allows to check in how far systems can exploit this additional information. But when doing experiments with the dataset we noticed some minor characteristics that might distort the results related. Due to the synthetic generation of the dataset, the ids for the different versions are following this pattern:

actor_453896_0
actor_453896_1
actor_453896_2
actor_453896_3
actor_453896_4
...

Many systems use not only datatype property values and labels as information source during the matching process, but also the ids of instances and classes. Especially for terminological matching (and many system started doing this) the id is often also a description, even though this is a bad modelling style. Each of these systems will have a great advantage over the other systems, which is not based on a better strategy but on this specifics of the dataset. Thus, it would be better to e.g. randomize the ids in some way.

Last Updated on Friday, 11 December 2009 11:59
 

Comments  

 
0 #1 Alfio Ferrara 2009-12-11 12:06
As Christian noticed the instance id "anonymizing" work has been done only for movies. This is an hint for the next version.
Quote
 

Add comment


Security code
Refresh