The enormous amount of structured and unstructured data produced in many fields leads to the era of big data. These data make the existing mining algorithms ineffective to process it. Therefore, the data reduction techniques are principally utilized prior to applying data mining algorithms. The instance selection is one of the promising reduction techniques advocated to reduce the size-volume of training dataset via selecting most relevant instances. However, the traditional instance selection methods suffer from the scalability of data, due to memory limitations. Recent approaches proposed to partition the training dataset into subsets and apply instance selection methods to individual subsets. Most of these approaches are based on a random partitioning, which negatively affects the performance of the instance selection methods, especially for a high number of subsets. In this work, we propose a new partitioning approach called automated overlapped distance-based partitioning. Our approach assigns the instances to the subsets regarding the distance. The instances can be assigned to two subsets based on a defined threshold. We implement and test experimentally the proposed approach using six standard datasets and the CNN method, a standard instance-selection condensation method. The results demonstrate that our approach is better than current random approaches in terms of the reduction rate and effectiveness criteria. Moreover, our approach is able to maintain a high reduction rate and effectiveness results when the numbers of subsets is increasing.
A New Automated Big Data Partitioning Approach to Improve Condensation Methods Performance
M. Malhat,M. El-Menshawy,Hamdy M. Mousa,Ashraf El-Sisi
Published 2018 in International Computer Engineering Conference
ABSTRACT
PUBLICATION RECORD
- Publication year
2018
- Venue
International Computer Engineering Conference
- Publication date
2018-12-01
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-18 of 18 references · Page 1 of 1
CITED BY
- No citing papers are available for this paper.
Showing 0-0 of 0 citing papers · Page 1 of 1