# C# KMeans + SpringBoot3 API This was a self-challenge during 2024's carnival holiday's weekend, and incremented to become an optional exercise for [an MBA in AI for Businesses](https://exame.com/faculdade/mba/mba-em-inteligencia-artificial-para-negocios) - proof-of-concept maturity. This repository reimplements KMeans in plain C# in the form of a externally-schedulable job¹, and also a minimal effort to make it a Spring Boot 3 API². This also provides some form of outlier detection. ¹ Deployable via crontab, your internal corporate solution, AWS Batch, Azure Batch, GCP Preemptible VMs, Oracle Burstable Instances, or anything else that runs a commandline at a scheduled time. May require tweaking to fit your needs.
² Deployable in your coporate TomCat, WildFly, Docker, Kubernetes, CloudFlare Workers, AWS Lambda, GCP Cloud Functions, or anything else you use to serve a Java Web API. May require tweaking to fit your needs. ## Example lifecycle 1. Train: ```sh make ``` It will produce an output like this: ``` dotnet run --configuration Release Started 2/16/2024 1:13:00 AM (0.0136183s) Loaded 2/16/2024 1:13:10 AM (9.8201451s, DatasetLines=2671097) Shuffled 2/16/2024 1:13:10 AM (0.1334194s) Scaled 2/16/2024 1:13:10 AM (0.0696958s) > TrainCentroids=[0.32,0.49,0.52,0.5,0.56] SplitTTV 2/16/2024 1:13:13 AM (2.7777124s, Train=2569595, Test=48222, Validation=53280) 10x Fits 2/16/2024 1:18:16 AM (302.8325479s) > WSSs=[0.23768,0.19774,0.17236,0.15302,0.13696,0.11922,0.11058,0.102,0.09685,0.09028] 10x Tests 2/16/2024 1:18:16 AM (0.1831497s) > SILs=[0.2277,0.22823,0.21612,0.23022,0.23678,0.24369,0.24482,0.24524,0.2472,0.24839] > AGGs=[-0.00998,0.03049,0.04376,0.0772,0.09982,0.12447,0.13424,0.14324,0.15035,0.15811] 10x Silhouettes 2/16/2024 1:39:22 AM (1266.4811917s) > OutlierScore(o=1.6, Min=0, Avg=0.4713820048875287, Max=0.8736922864125183) Scalers=[ [32.5, 42.03755], [-124.49859, -114.1], [0, 6], [1, 366], [0, 86340] ] Centroids=[ [0.16, 0.62, 0.77, 0.76, 0.72], [0.57, 0.28, 0.83, 0.76, 0.59], [0.57, 0.28, 0.83, 0.25, 0.57], [0.18, 0.61, 0.82, 0.6, 0.25], [0.17, 0.62, 0.27, 0.25, 0.29], [0.16, 0.63, 0.24, 0.79, 0.52], [0.16, 0.62, 0.76, 0.23, 0.65], [0.17, 0.62, 0.17, 0.3, 0.73], [0.57, 0.28, 0.27, 0.22, 0.58], [0.56, 0.29, 0.26, 0.74, 0.72], [0.56, 0.29, 0.28, 0.69, 0.27] ] DescaledCentroids=[ [34.026008, -118.0514642, 4.62, 278.4, 62164.799999999996], [37.936403500000004, -121.5869848, 4.9799999999999995, 278.4, 50940.6], [37.936403500000004, -121.5869848, 4.9799999999999995, 92.25, 49213.799999999996], [34.216759, -118.1554501, 4.92, 220, 21585], [34.1213835, -118.0514642, 1.62, 92.25, 25038.6], [34.026008, -117.9474783, 1.44, 289.35, 44896.8], [34.026008, -118.0514642, 4.5600000000000005, 84.95, 56121], [34.1213835, -118.0514642, 1.02, 110.5, 63028.2], [37.936403500000004, -121.5869848, 1.62, 81.3, 50077.2], [37.841028, -121.4829989, 1.56, 271.1, 62164.799999999996], [37.841028, -121.4829989, 1.6800000000000002, 252.85, 23311.800000000003] ] Prevalence=[1,1,1,1,1,1,1,1,1,1,1] CompensatedPrevalence=[2,0,1,1,1,3,4,1,1,1,1] bestK=11 clusters WSS=0.09427714334438726 Sil=0.24316338789249595 Agg=0.14888624454810867 Validation 2/16/2024 1:42:01 AM (158.4397816s) All Done! 2/16/2024 1:42:01 AM (0.000619s; Total=1740.7518809s) memusg: peak=3274104 ``` 2. Run server: ```sh make runapi ``` 3. Visit with the browser: `http://localhost:8080` 4. Submit the request: | Field | Value | | ----- | ----- | | latitude | 34.16449 | | longitude | -118.15798 | | date | 2009-01-14 | | time | 14:15:00 | ```sh curl -s 'http://localhost:8080/model?latitude=34.16449&longitude=-118.15798&date=2009-01-14&time=14%3A15%3A00' | jq ``` 5. See the response: ```json { "cluster": { "id": 6, "truePrevalence": { "id": 1, "label": "1-possible" }, "compensatedPrevalence": { "id": 4, "label": "4-dead" } }, "outlierScore": 0.5144809550353892 } ``` ## Performance The dataset contains 2,671,097 lines by 4 columns stored as double (8 bytes), which is at least 85,475,104 bytes (81.5 MiB). The single-threaded³ C# code performance was evaluated on these systems: | Hardware | 7900X | MacMini M2 | Dell 3511 i5 | i7-4790 | | ------------- | ------------------ | --------------- | ----------------- | ------------------ | | Form factor | Desktop | Desktop | Laptop | Desktop | | Processor | Ryzen 9 7900X | Apple M2 | Intel i5-1135G7 | Intel i7-4790 | | Cache L3 | 64MB | 8MB | 8MB | 8MB | | Max Frequency | 4.70 GHz | 3.48 GHz | 2.40 GHz | 3.60 GHz | | Max Turbo | 5.70 GHz | 3.48 GHz | 4.20 GHz | 4.00 GHz | | Storage Type | SSD | SSD | SSD | HDD | | RAM | 4×32GB @ DDR5-4000 | 16GB @ 6400MT/s | 2×8GB @ DDR4-2666 | 2×8 GB @ DDR3-1366 | | Kernel | Linux 6.7.4-zen1 | Darwin 23.0.0 | Linux 6.7.0-zen3 | Linux 6.6.8-arch-1 | Therefore, we should should see some memory busses saturated. ³ There are parallelization paths, and they are explicit by their prefix “10x”, but I believe that in a corporate environment there would be many jobs running in parallel and the predictability of a stable resource allocation would have a greater importance. ### RAM resource Memory was measured by watching the numbers on the resource monitor on each system. Under Linux, that means `htop` and on Mac that means `Activity Monitor`. | RAM usage | 7900X | MacMini M2 | Dell 3511 i5 | i7-4790 | | --------- | ------ | ---------- | ------------ | ------- | | | 3.2 GB | 1.1 GB⁴ | 3.2 GB | 3.2 GB | ⁴ Apple tries cheating by compressing processes' memory, but it backfires when it needs decompressing data in order to use it. ### Processor resource These timings are measured by the own program. | Stage | 7900X | MacMini M2 | Dell 3511 i5 | i7-4790 | | --------------- | ------------ | ----------- | ------------ | ------------ | | Started | 0.0145117 | 0.030513 | 0.019999 | 0.0237424 | | Loaded | 3.5465618 | 6.516409 | 7.7220067 | 8.1680365 | | Shuffled | 0.0226329 | 0.02862 | 0.0641004 | 0.048993 | | Scaled | 0.0583177 | 0.070994 | 0.0663303 | 0.0903944 | | SplitTTV | 0.7055008 | 1.027031 | 1.2220999 | 1.3453164 | | 10x Fits | 194.6585428 | 237.454974 | 309.4348727 | 405.7263568 | | 10x Tests | 0.1762073 | 0.203529 | 0.275389 | 0.3948622 | | 10x Silhouettes | 1395.8565106 | 1499.822873 | 2118.9745899 | 2955.7305509 | | Validation | 141.7715882 | 151.862509 | 217.2271935 | 299.9685991 | | All Done! | 0.0005635 | 0.001758 | 0.0006435 | 0.0006692 | | Total | 1736.8109373 | 1897.01921 | 2655.0072249 | 3671.4975209 | Therefore, we can confirm that L3 cache size and memory bandwidth are more important than CPU “speed”. ## Innovations None. This is inherently no innovation, as: - Math-wise: - KMeans is an old algorithm, known [since at least 1956](https://stats.stackexchange.com/a/82740); - WSS (Within-Cluster Sum of Squares) is just a fancy name for a specific kind of variance, which the latter exists [since at least 1923](https://link.springer.com/chapter/10.1007/978-1-4612-6079-0_4); - Silhouette, the newest of it all, was [proposed in 1987](). - Programming-wise: - C# is an old programming language, available [since 2002](https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-version-history#c-version-10-1); - Spring Boot is an old web framework, available [since 2014](https://spring.io/blog/2014/04/01/spring-boot-1-0-ga-released); - Java is an old programming language, available [since 1996](https://en.wikipedia.org/wiki/Java_version_history#Release_table). The “youngest” item is 10 years old by the time this line got written. That's no innovation. If you “innovate” using these technologies in your business, it's just a century worth of technical debts that you are removing from your outworn processes. ## License The implementation is licensed under MIT-0, which basically means Public Domain. ## Links - Canonical: https://git.adlerneves.com/adler/cskmeans - Mirror: https://github.com/adlerosn/cskmeans