cskmeans/README.md

182 lines
8.9 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# C# KMeans + SpringBoot3 API
This was a self-challenge during 2024's carnival holiday's weekend, and incremented to become an optional exercise for [an MBA in AI for Businesses](https://exame.com/faculdade/mba/mba-em-inteligencia-artificial-para-negocios) - proof-of-concept maturity.
This repository reimplements KMeans in plain C# in the form of a externally-schedulable job¹, and also a minimal effort to make it a Spring Boot 3 API². This also provides some form of outlier detection.
¹ Deployable via crontab, your internal corporate solution, AWS Batch, Azure Batch, GCP Preemptible VMs, Oracle Burstable Instances, or anything else that runs a commandline at a scheduled time. May require tweaking to fit your needs.<br>
² Deployable in your coporate TomCat, WildFly, Docker, Kubernetes, CloudFlare Workers, AWS Lambda, GCP Cloud Functions, or anything else you use to serve a Java Web API. May require tweaking to fit your needs.
## Example lifecycle
1. Train:
```sh
make
```
It will produce an output like this:
```
dotnet run --configuration Release
Started 2/16/2024 1:13:00AM (0.0136183s)
Loaded 2/16/2024 1:13:10AM (9.8201451s, DatasetLines=2671097)
Shuffled 2/16/2024 1:13:10AM (0.1334194s)
Scaled 2/16/2024 1:13:10AM (0.0696958s)
> TrainCentroids=[0.32,0.49,0.52,0.5,0.56]
SplitTTV 2/16/2024 1:13:13AM (2.7777124s, Train=2569595, Test=48222, Validation=53280)
10x Fits 2/16/2024 1:18:16AM (302.8325479s)
> WSSs=[0.23768,0.19774,0.17236,0.15302,0.13696,0.11922,0.11058,0.102,0.09685,0.09028]
10x Tests 2/16/2024 1:18:16AM (0.1831497s)
> SILs=[0.2277,0.22823,0.21612,0.23022,0.23678,0.24369,0.24482,0.24524,0.2472,0.24839]
> AGGs=[-0.00998,0.03049,0.04376,0.0772,0.09982,0.12447,0.13424,0.14324,0.15035,0.15811]
10x Silhouettes 2/16/2024 1:39:22AM (1266.4811917s)
> OutlierScore(o=1.6, Min=0, Avg=0.4713820048875287, Max=0.8736922864125183)
Scalers=[
[32.5, 42.03755], [-124.49859, -114.1], [0, 6], [1, 366], [0, 86340]
]
Centroids=[
[0.16, 0.62, 0.77, 0.76, 0.72],
[0.57, 0.28, 0.83, 0.76, 0.59],
[0.57, 0.28, 0.83, 0.25, 0.57],
[0.18, 0.61, 0.82, 0.6, 0.25],
[0.17, 0.62, 0.27, 0.25, 0.29],
[0.16, 0.63, 0.24, 0.79, 0.52],
[0.16, 0.62, 0.76, 0.23, 0.65],
[0.17, 0.62, 0.17, 0.3, 0.73],
[0.57, 0.28, 0.27, 0.22, 0.58],
[0.56, 0.29, 0.26, 0.74, 0.72],
[0.56, 0.29, 0.28, 0.69, 0.27]
]
DescaledCentroids=[
[34.026008, -118.0514642, 4.62, 278.4, 62164.799999999996],
[37.936403500000004, -121.5869848, 4.9799999999999995, 278.4, 50940.6],
[37.936403500000004, -121.5869848, 4.9799999999999995, 92.25, 49213.799999999996],
[34.216759, -118.1554501, 4.92, 220, 21585],
[34.1213835, -118.0514642, 1.62, 92.25, 25038.6],
[34.026008, -117.9474783, 1.44, 289.35, 44896.8],
[34.026008, -118.0514642, 4.5600000000000005, 84.95, 56121],
[34.1213835, -118.0514642, 1.02, 110.5, 63028.2],
[37.936403500000004, -121.5869848, 1.62, 81.3, 50077.2],
[37.841028, -121.4829989, 1.56, 271.1, 62164.799999999996],
[37.841028, -121.4829989, 1.6800000000000002, 252.85, 23311.800000000003]
]
Prevalence=[1,1,1,1,1,1,1,1,1,1,1]
CompensatedPrevalence=[2,0,1,1,1,3,4,1,1,1,1]
bestK=11 clusters
WSS=0.09427714334438726
Sil=0.24316338789249595
Agg=0.14888624454810867
Validation 2/16/2024 1:42:01AM (158.4397816s)
All Done! 2/16/2024 1:42:01AM (0.000619s; Total=1740.7518809s)
memusg: peak=3274104
```
2. Run server:
```sh
make runapi
```
3. Visit with the browser: `http://localhost:8080`
4. Submit the request:
| Field | Value |
| ----- | ----- |
| latitude | 34.16449 |
| longitude | -118.15798 |
| date | 2009-01-14 |
| time | 14:15:00 |
```sh
curl -s 'http://localhost:8080/model?latitude=34.16449&longitude=-118.15798&date=2009-01-14&time=14%3A15%3A00' | jq
```
5. See the response:
```json
{
"cluster": {
"id": 6,
"truePrevalence": {
"id": 1,
"label": "1-possible"
},
"compensatedPrevalence": {
"id": 4,
"label": "4-dead"
}
},
"outlierScore": 0.5144809550353892
}
```
## Performance
The dataset contains 2,671,097 lines by 4 columns stored as double (8 bytes), which is at least 85,475,104 bytes (81.5 MiB).
The single-threaded³ C# code performance was evaluated on these systems:
| Hardware | 7900X | MacMini M2 | Dell 3511 i5 | i7-4790 |
| ------------- | ------------------ | --------------- | ----------------- | ------------------ |
| Form factor | Desktop | Desktop | Laptop | Desktop |
| Processor | Ryzen 9 7900X | Apple M2 | Intel i5-1135G7 | Intel i7-4790 |
| Cache L3 | 64MB | 8MB | 8MB | 8MB |
| Max Frequency | 4.70 GHz | 3.48 GHz | 2.40 GHz | 3.60 GHz |
| Max Turbo | 5.70 GHz | 3.48 GHz | 4.20 GHz | 4.00 GHz |
| Storage Type | SSD | SSD | SSD | HDD |
| RAM | 4×32GB @ DDR5-4000 | 16GB @ 6400MT/s | 2×8GB @ DDR4-2666 | 2×8 GB @ DDR3-1366 |
| Kernel | Linux 6.7.4-zen1 | Darwin 23.0.0 | Linux 6.7.0-zen3 | Linux 6.6.8-arch-1 |
Therefore, we should should see some memory busses saturated.
³ There are parallelization paths, and they are explicit by their prefix “10x”, but I believe that in a corporate environment there would be many jobs running in parallel and the predictability of a stable resource allocation would have a greater importance.
### RAM resource
Memory was measured by watching the numbers on the resource monitor on each system. Under Linux, that means `htop` and on Mac that means `Activity Monitor`.
| RAM usage | 7900X | MacMini M2 | Dell 3511 i5 | i7-4790 |
| --------- | ------ | ---------- | ------------ | ------- |
| | 3.2 GB | 1.1 GB⁴ | 3.2 GB | 3.2 GB |
⁴ Apple tries cheating by compressing processes' memory, but it backfires when it needs decompressing data in order to use it.
### Processor resource
These timings are measured by the own program.
| Stage | 7900X | MacMini M2 | Dell 3511 i5 | i7-4790 |
| --------------- | ------------ | ----------- | ------------ | ------------ |
| Started | 0.0145117 | 0.030513 | 0.019999 | 0.0237424 |
| Loaded | 3.5465618 | 6.516409 | 7.7220067 | 8.1680365 |
| Shuffled | 0.0226329 | 0.02862 | 0.0641004 | 0.048993 |
| Scaled | 0.0583177 | 0.070994 | 0.0663303 | 0.0903944 |
| SplitTTV | 0.7055008 | 1.027031 | 1.2220999 | 1.3453164 |
| 10x Fits | 194.6585428 | 237.454974 | 309.4348727 | 405.7263568 |
| 10x Tests | 0.1762073 | 0.203529 | 0.275389 | 0.3948622 |
| 10x Silhouettes | 1395.8565106 | 1499.822873 | 2118.9745899 | 2955.7305509 |
| Validation | 141.7715882 | 151.862509 | 217.2271935 | 299.9685991 |
| All Done! | 0.0005635 | 0.001758 | 0.0006435 | 0.0006692 |
| Total | 1736.8109373 | 1897.01921 | 2655.0072249 | 3671.4975209 |
Therefore, we can confirm that L3 cache size and memory bandwidth are more important than CPU “speed”.
## Innovations
None. This is inherently no innovation, as:
- Math-wise:
- KMeans is an old algorithm, known [since at least 1956](https://stats.stackexchange.com/a/82740);
- WSS (Within-Cluster Sum of Squares) is just a fancy name for a specific kind of variance, which the latter exists [since at least 1923](https://link.springer.com/chapter/10.1007/978-1-4612-6079-0_4);
- Silhouette, the newest of it all, was [proposed in 1987](<https://en.wikipedia.org/wiki/Silhouette_(clustering)>).
- Programming-wise:
- C# is an old programming language, available [since 2002](https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-version-history#c-version-10-1);
- Spring Boot is an old web framework, available [since 2014](https://spring.io/blog/2014/04/01/spring-boot-1-0-ga-released);
- Java is an old programming language, available [since 1996](https://en.wikipedia.org/wiki/Java_version_history#Release_table).
The “youngest” item is 10 years old by the time this line got written. That's no innovation.
If you “innovate” using these technologies in your business, it's just a century worth of technical debts that you are removing from your outworn processes.
## License
The implementation is licensed under MIT-0, which basically means Public Domain.
## Links
- Canonical: https://git.adlerneves.com/adler/cskmeans
- Mirror: https://github.com/adlerosn/cskmeans