Module 4 Workflow | AI in Construction

Workflow stages

Understand what machine learning is doing here

Module 4 is a supervised machine-learning workflow. We give the model many past concrete mixtures together with their real measured strength, and the model learns a pattern that can be used later on a new mix design.

What is machine learning here?

In this module, machine learning means using past data to learn a relationship between concrete ingredients, curing age, and final compressive strength.

Input data: cement, slag, fly ash, water, superplasticizer, aggregates, and age
Target value: compressive strength in MPa
Goal: estimate strength for a new recipe

What kind of problem is this?

This is a regression problem, not classification. The model predicts a number, not a category.

Output is continuous: for example 32.4 MPa or 51.8 MPa
The model is trying to get close to the true measured strength
Later we judge the model by prediction error, not by accuracy

What do we train?

We train the model parameters. In linear regression these are the coefficients for each feature. In random forest or XGBoost, these are the learned split rules inside many decision trees.

What happens after training?

After training, we have a fitted model plus evaluation results such as MAE, RMSE, and R2 that tell us how well it works on unseen data.

How do we predict later?

We give the trained model a new concrete mix with the same input columns. The model uses what it learned during training to estimate the strength of that new mix.

Real dataset used in the notebook

The notebook uses the UCI concrete compressive strength dataset with 1030 rows.
There are 8 input features and 1 target.
Strength ranges from about 2.33 to 82.60 MPa.
Average strength is about 35.82 MPa.

1

Past data

Concrete recipes plus real measured strength values.

2

Training

The model learns patterns that connect inputs to strength.

3

Fitted model

We keep the learned model after training is finished.

4

New prediction

A new mix goes in and the model returns predicted MPa.

1

Load data

Read the CSV and inspect the columns, summary statistics, and correlations.

2

Prepare features

Split the target from the predictors and create train and test sets.

3

Train models

Fit baseline and stronger regressors such as linear regression and random forest.

4

Evaluate

Compare MAE, RMSE, and R2 to see which model generalizes better.

5

Test a new mix

Enter a new recipe and preview how the workflow produces a strength estimate.

What the notebook is teaching

How tabular construction data becomes a machine-learning dataset.
Why train/test splitting comes before scaling and tuning.
How simple and advanced models can be compared fairly.
How evaluation metrics translate into model quality.

Data setup

Prepare the dataset and workflow settings

This step mirrors the early notebook cells: load the CSV, inspect the data, separate features from the target, and choose how the train/test split and preprocessing should work.

Workflow settings

Test split 20% of the dataset reserved for testing Random state Used for a reproducible split Feature scaling Scaling helps linear models and keeps the workflow explicit Advanced model Tree models usually handle nonlinear relationships better Hyperparameter tuning Adds a tuning stage after the first model comparison

What happens in this step

Read Concrete_Data.csv into a dataframe.
Check rows, columns, summary statistics, and correlations.
Separate the target column from the eight input variables.
Create train and test sets before scaling to avoid leakage.

Example rows from the real dataset

Cement	Slag	Water	Superplasticizer	Age	Strength
540.0	0.0	162.0	2.5	28	79.99
540.0	0.0	162.0	2.5	28	61.89
332.5	142.5	228.0	0.0	270	40.27
198.6	132.4	192.0	0.0	360	44.30

Feature columns

Cement, slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and age.

Target column

Concrete compressive strength(MPa, megapascals)

Train set 80%

Test set 20%

Training rows teach the model. Test rows stay unseen until evaluation.

1. Import libraries

Start with pandas, plotting libraries, and the scikit-learn tools used throughout the workflow.

2. Load and inspect the dataset

These cells load the CSV, preview rows, and inspect summary statistics and correlations.

3. Split features and target

This is where the workflow creates the training and testing partitions.

4. Optional feature scaling

Scaling is shown after the split so the notebook avoids data leakage.

Model training

See what training means in machine learning

Training means the model looks at many known examples and adjusts its internal parameters so its predictions get closer to the real strength values. The notebook first fits a simple baseline, then a stronger tree-based model, and optionally a tuned version.

Linear Regression

A baseline model that learns a weighted linear relationship between the inputs and strength.

Simple and interpretable
Good benchmark model
May miss nonlinear behavior

Random Forest

An ensemble of decision trees that usually performs better on structured data with nonlinear patterns.

Captures interactions between variables
Provides feature importance
Strong fit for tabular data

XGBoost

An optional boosted-tree model that often performs very well, but may require extra installation and tuning.

Optional in the notebook
Often improves accuracy
Adds complexity

What gets trained in each model?

Linear Regression: learns one coefficient for each input feature and an intercept.
Random Forest: learns many decision trees and combines their outputs.
XGBoost: learns boosted trees stage by stage to reduce error.
In all cases, training uses the training set where both the inputs and the true strength values are known.

1

Read training rows

The model sees many concrete recipes and their known measured strength.

2

Learn patterns

It adjusts coefficients or tree rules to better connect inputs to strength.

3

Reduce error

The model keeps improving until predictions are closer to the known target values.

4

Keep the fitted model

After training, we keep the learned model and use it later for evaluation and prediction.

What we have after training

A fitted model object, prediction errors on the test set, and plots or tables that help compare models.

What we do not have

We do not have a perfect formula for every possible mix. The model only learns from the patterns present in the training data.

Why testing matters

Testing on unseen mixes helps us check whether the learned model generalizes instead of only memorizing the training set.

Training code

5. Train baseline model

The notebook starts with linear regression so every later result can be compared against a simple baseline.

6. Train advanced model

This block switches to the stronger tree-based model selected above.

7. Optional hyperparameter tuning

RandomizedSearchCV searches over a small parameter space to improve the tree model.

Feature importance preview

This visual mimics the notebook idea that tree models can rank which variables matter most.

How to read this

Higher bars mean the model relied more on that variable.
Age and cement often matter strongly for strength prediction.
Feature importance is most meaningful for tree-based models.

Evaluation

Compare model quality with meaningful regression metrics

Regression evaluation is different from classification. Instead of accuracy, this module uses MAE, RMSE, and R2 to compare how close predictions are to real measured strength values.

Model comparison chart

What the metrics mean

MAE is the average absolute prediction error in MPa.
RMSE penalizes larger mistakes more heavily than MAE.
R2 measures how much variance in strength the model explains.

How to interpret the chart

Lower MAE and RMSE are better.
Higher R2 is better.
A better model is not always the most complex one, but it often captures more nonlinear patterns.

8. Evaluate the models

These cells calculate MAE, RMSE, and R2 on the held-out test set.

9. Summarize the results

The notebook compares all trained models in one compact results table.

Testing

Test the workflow on a new concrete mix

Prediction happens after training. We enter a new concrete mix with the same input columns, and the trained model estimates its compressive strength. The browser preview below is a teaching demo, while the real notebook prediction comes from the trained scikit-learn model in Colab.

New mix design

Cement (kg/m3) Slag (kg/m3) Fly ash (kg/m3) Water (kg/m3) Superplasticizer (kg/m3) Coarse aggregate (kg/m3) Fine aggregate (kg/m3) Age (days)

Demo reminder

Demo only: the browser estimate is for understanding the workflow.
The real prediction in the notebook comes from the model fitted on the training data.
The purpose here is to show how a new mix moves into the testing stage.

Prediction preview

Choose a mix and click "Preview prediction" to see how the workflow would report model outputs.

New concrete mix

Apply same input columns

Run trained model

Get predicted MPa

10. Predict on new data

This final code block shows how the trained model is used on unseen concrete recipes.