Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

C. Sreedhar,N. Kasiviswanath,P. C. Reddy

Published 2017 in Journal of Big Data

ABSTRACT

Big data has become popular for processing, storing and managing massive volumes of data. The clustering of datasets has become a challenging issue in the field of big data analytics. The K-means algorithm is best suited for finding similarities between entities based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. This study presents two approaches to the clustering of large datasets using MapReduce. The first approach, K-Means Hadoop MapReduce (KM-HMR), focuses on the MapReduce implementation of standard K-means. The second approach enhances the quality of clusters to produce clusters with maximum intra-cluster and minimum inter-cluster distances for large datasets. The results of the proposed approaches show significant improvements in the efficiency of clustering in terms of execution times. Experiments conducted on standard K-means and proposed solutions show that the KM-I2C approach is both effective and efficient.

PUBLICATION RECORD

Publication year
2017
Venue
Journal of Big Data
Publication date
2017-12-01
Fields of study
Computer Science
Identifiers
DOI 10.1186/s40537-017-0087-2
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Soft Clustering for Very Large Data Sets
2017cited by this paper
A distributed spatial-temporal weighted model on MapReduce for short-term traffic flow forecasting
2016cited by this paper
A Novel Multilevel Queue based Performance Analysis of Hadoop Job Schedulers
2016cited by this paper
Practical Hadoop Ecosystem
2016cited by this paper
Scaling Spark on HPC Systems
2016cited by this paper
Parallel Black Hole Clustering Based on MapReduce
2015cited by this paper
A Survey on Big Data Management and Job Scheduling
2015cited by this paper
Pro Apache Hadoop
2014cited by this paper
Meteorological Data Analysis Using MapReduce
2014cited by this paper
HCatalog and Hadoop in the Enterprise
2014cited by this paper
Transaction support for HBase
2014cited by this paper
The Million Song Dataset
2011cited by this paper
Clustering very large multi-dimensional datasets with MapReduce
2011cited by this paper
A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework
2011cited by this paper
EST Clustering in Large Dataset with MapReduce
2010cited by this paper
ZooKeeper: Wait-free Coordination for Internet-scale Systems
2010cited by this paper
Parallel K-Means Clustering Based on MapReduce
2009cited by this paper
Hive - A Warehousing Solution Over a Map-Reduce Framework
2009cited by this paper
Pig latin: a not-so-foreign language for data processing
2008cited by this paper
Map-Reduce for Machine Learning on Multicore
2007cited by this paper
InteGrade : a Tool for Executing Parallel Applications on a Grid for Opportunistic Computing ∗
2005cited by this paper
Concept Decompositions for Large Sparse Text Data Using Clustering
2004cited by this paper
A two-phase K-means algorithm for large datasets
2004cited by this paper
[A new method for EST clustering].
2003influential reference
Data Mining: Concepts and Techniques
2000cited by this paper
Efficient clustering of high-dimensional data sets with application to reference matching
2000cited by this paper
Supporting Ranked Boolean Similarity Queries in MARS
1998cited by this paper
Similarity of Attributes by External Probes
1998cited by this paper
Bayesian Classification (AutoClass): Theory and Results
1996cited by this paper
Linear-Time Rule Induction
1996cited by this paper
ITERATE: A Conceptual Clustering Method for Knowledge Discovery in Databases
1994cited by this paper
A self-organizing network for hyperellipsoidal clustering (HEC)
1994cited by this paper
Finding Groups in Data: An Introduction to Cluster Analysis
1990cited by this paper
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
1987cited by this paper
Knowledge Acquisition Via Incremental Conceptual Clustering
1987cited by this paper
Clustering to Minimize the Maximum Intercluster Distance
1985cited by this paper
Delineation and analysis of clusters in orientation data
1976cited by this paper
Programs for Machine Learning. Part I
1962cited by this paper

CITED BY

Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images
2025cites this paper
MapReduce algorithms for robust center-based clustering in doubling metrics
2024cites this paper
The Parallel Fuzzy C-Median Clustering Algorithm Using Spark for the Big Data
2024cites this paper
A new scenario framework for equitable and climate-compatible futures
2024cites this paper
Development of an AI predictive model to categorize and predict online learning behaviors of students in Thailand
2024cites this paper
Optimization of Steaming Conditions for Bellflower Root (Platycodon grandiflorus) Using K-Means Clustering-Based Morphological Grading System
2024cites this paper
Geometry-Inference Based Clustering Heuristic: New k-means Metric for Gaussian Data and Experimental Proof of Concept
2024cites this paper
Efficient clustering in data mining applications based on harmony search and k-medoids
2024cites this paper
Effective Machine Learning Solution for State Classification and Productivity Identification: Case of Pneumatic Pressing Machine
2024cites this paper
Research on the estimation and spatial pattern of net tourism carbon emissions in the Yellow River Basin from 2009 to 2019
2024cites this paper
Employing Deep Learning Approaches to Detect Deepfake Attributes in Videos
2024cites this paper
K-Means Binary Search Centroid With Dynamic Cluster for Java Island Health Clustering
2023cites this paper
Sparse FCM-Based Map-Reduce Framework for Distributed Parallel Data Clustering in E-Khool Learning Platform
2023cites this paper
On hierarchical clustering-based approach for RDDBS design
2023cites this paper
Big data: an optimized approach for cluster initialization
2023cites this paper
The Evaluation Algorithm of English Teaching Ability Based on Big Data Fuzzy K-Means Clustering
2023cites this paper
Data Mining for Non-Redundant Big Data Using dynamic KMEAN
2023cites this paper
Scalable and space-efficient Robust Matroid Center algorithms
2023cites this paper
Policy-based heterogeneous server utilisation using controller framework
2022cites this paper
A Parallel Fractional Lion Algorithm for Data Clustering Based on MapReduce Cluster Framework
2022influential citation
Distributed k-Means with Outliers in General Metrics
2022cites this paper
Sorensen-dice similarity indexing based weighted iterative clustering for big data analytics
2022cites this paper
Application Of K-Means Algorithm In Grouping Productive Seed Distribution Data In BPDASHL Asahan Barumun
2022cites this paper
Aplikasi Dynamic Cluster pada K-Means BerbasisWeb untuk Klasifikasi Data Industri Rumahan
2022cites this paper
A clustering approach for data quality results of research information systems
2022cites this paper
Self-Adaptive Threshold Clustering in Abnormal Cardiac Beat Detection on Lightweight Embedded Devices
2022cites this paper
Trends in Occupational Infectious Diseases in South Korea and Classification of Industries According to the Risk of Biological Hazards Using K-Means Clustering
2022cites this paper
Harnessing Semi-Supervised Machine Learning to Automatically Predict Bioactivities of Per- and Polyfluoroalkyl Substances (PFASs)
2022cites this paper
Evolutionary Computing Assisted K-Means Clustering based MapReduce Distributed Computing Environment for IoT-Driven Smart City
2021cites this paper
Improving the K-Means Clustering Algorithm Oriented to Big Data Environments
2021cites this paper
Designing a relational model to identify relationships between suspicious customers in anti-money laundering (AML) using social network analysis (SNA)
2021cites this paper
An elitism based self-adaptive multi-population Poor and Rich optimization algorithm for grouping similar documents
2021cites this paper
Application-Programming Interface (API) for Song Recognition Systems
2021cites this paper
The application of K-means clustering for province clustering in Indonesia of the risk of the COVID-19 pandemic based on COVID-19 data
2021cites this paper
A New Adaptive Hybrid Mutation Black Widow Clustering Based Data Partitioning for Big Data Analysis
2021cites this paper
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
2021cites this paper
Features Clustering Around Latent Variables for High Dimensional Data
2021cites this paper
Spherical fuzzy extension of AHP‐ARAS methods integrated with modified k‐means clustering for logistics hub location problem
2021cites this paper
Study of Canopy and K-Means Clustering Algorithm Based on Mahout for E-commerce Product Quality Analysis
2021cites this paper
Unsupervised Learning for Large Scale Data: The ATHLOS Project
2021cites this paper
Optimization of Classification Accuracy Using K-Means and Genetic Algorithm by Integrating C4.5 Algorithm for Diagnosis Breast Cancer Disease
2021cites this paper
Analysis towards Enhanced Big Data Clustering Technique
2020cites this paper
The Systematic Review of K-Means Clustering Algorithm
2020cites this paper
QAOC: Novel query analysis and ontology-based clustering for data management in Hadoop
2020cites this paper
SVM Optimization with Correlation Feature Selection Based Binary Particle Swarm Optimization for Diagnosis of Chronic Kidney Disease
2020cites this paper
Implementation of Parallel K-Means Algorithm to Estimate Adhesion Failure in Warm Mix Asphalt
2020cites this paper
Applied Research on Agricultural Big Data
2020cites this paper
PARTITIONED GLOBAL ADDRESS SPACE APPROACH FOR THE MAPREDUCE IMPLEMENTATION OF THE PARALLEL KMEANS ALGORITHM
2020cites this paper
Analyzing the College Freshmen Aptitude Results using K-means Algorithm
2020cites this paper
A Novel Machine Learning Approach Combined with Optimization Models for Eco-efficiency Evaluation
2020cites this paper
Unstructured Data Clustering Using Hybrid K-Means And Grasshopper Optimization Algorithm (Kmeans-GOA)
2020cites this paper
Clustering of Multidimensional Big Data using Enhanced K-Mean Algorithm
2020cites this paper
Density based clustering and fuzzy clustering for efficient clustering of big data in hadoop ecosystem
2020cites this paper
Deep Learning and its Applications: A Survey
2019cites this paper
Evaluating K-means multidimensional big data clusters through MapReduce paradigm
2019cites this paper
Study on Clustering Computing Methods of Big Data
2019cites this paper
Moore Data Clustering Based Bloom Hash Storage for Dimensionality Reduction of Big Data Analytics
2019influential citation
A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets
2019influential citation
DENCAST: distributed density-based clustering for multi-target regression
2019cites this paper
Graph-based clustering of extracted paraphrases for labelling crime reports
2019cites this paper
Clustering Algorithms for Huge Datasets: A Mathematical Approach
2019cites this paper
MongoDB Clustering using K-means for Real-Time Song Recognition
2019cites this paper
Kernelized Spectral Clustering based Conditional MapReduce function with big data
2019influential citation
An Improved Initialization Method using Firefly Movement and Light Intensity for Better Clustering Performance
2019cites this paper
Scalable k -means for large-scale clustering
2019cites this paper
Big data clustering with varied density based on MapReduce
2019cites this paper
Generalized Jaccard Similarity Based Multilevel Threshold Affinity Propagated Clustering For Big Data Analytics
2018cites this paper
A Literature Review on Hadoop Ecosystem and Various Techniques of Big Data Optimization
2018cites this paper
Impact of big data on computer graphics
2017cites this paper
Jurnal Ilmiah Matematika
year unknowncites this paper
Machine Learning Challenges with BigData
year unknowncites this paper
Evaluation Of The Efficiency Of Clustering Using Ik-Means And Imap-Reduce Approach For Microarray Data
year unknowncites this paper
APACHE HADOOP PERFORMANCE EVALUATION WITH RESOURCES MONITORING TOOLS, AND PARAMETERS OPTIMIZATION: IOT EMERGING DEMAND
year unknowncites this paper