Georgiades, MichaelMichaelGeorgiadesHussain, FaisalFaisalHussainChristodoulou, LakisLakisChristodoulouDr. HO Kin-Hon, RoyRoyDr. HO Kin-HonHou, YunYunHouGregoriades, AndreasAndreasGregoriades2025-11-242025-11-242025Georgiades, M., Hussain, F., Christodooulou, L., Ho, K. H., Hou, Y., & Gregoriades, A. (2025). Scalable intrusion detection in IoT networks: Evaluating PySpark pipelines and design trade-offs. In IEEE (Ed.). 2025 21st International conference on distributed computing in smart systems and the internet of things (DCOSS-IoT). 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), Lucca, Italy (pp. 1-8). IEEE.979833154372397983315437302325-29442325-2936http://hdl.handle.net/20.500.11861/26180The rapid growth of Internet of Things (IoT) networks has introduced challenges in securing large-scale, real-time environments against evolving cyber threats. This study evaluates scalable machine learning workflows implemented in PySpark for intrusion detection using the RT-IoT2022 dataset. We compare manual feature engineering with automated pipeline-based approaches across classifiers including Logistic Regression, Naïve Bayes, Decision Tree, and Random Forest. Leveraging PySpark's distributed processing and modular components—such as Pipeline, StringIndexer, VectorAssembler, and MinMaxS-caler—we assess how workflow design affects performance metrics (Accuracy, Precision, Recall, and F1 Score), execution time, and model interpretability. Our findings reveal trade-offs between modularity, transparency, and latency, highlighting the need to align workflow architecture with deployment goals. The results provide practical insights for designing explainable, scalable, and resource-aware intrusion detection systems for real-time IoT security.enInternet of ThingsIntrusion DetectionInternet of Things NetworksResilient Distributed DatasetLogistic RegressionMachine LearningRandom ForestDecision TreePerformance MetricsF1 ScoreModularityModel InterpretationDistribution ProcessReal-Time EnvironmentRegression ForestIntrusion Detection SystemInternet of Things SecurityDeep LearningBig DataDistributed Denial of ServiceConcept DriftScalable FrameworkPipeline StagesDetection AccuracyBlended LearningApache SparkInternet of Things SystemsAnomaly DetectionReal-Time DetectionScalable intrusion detection in IoT networks: Evaluating PySpark pipelines and design trade-offsConference Paper10.1109/DCOSS-IoT65416.2025.00119