This page turns the concrete strength notebook into a step-by-step workflow. It explains the regression problem, shows how the data moves through preprocessing and modeling, and lets you preview evaluation and testing before opening Colab.
Module 4 is a supervised machine-learning workflow. We give the model many past concrete mixtures together with their real measured strength, and the model learns a pattern that can be used later on a new mix design.
What is machine learning here?
In this module, machine learning means using past data to learn a relationship between concrete ingredients, curing age, and final compressive strength.
Input data: cement, slag, fly ash, water, superplasticizer, aggregates, and age
Target value: compressive strength in MPa
Goal: estimate strength for a new recipe
What kind of problem is this?
This is a regression problem, not classification. The model predicts a number, not a category.
Output is continuous: for example 32.4 MPa or 51.8 MPa
The model is trying to get close to the true measured strength
Later we judge the model by prediction error, not by accuracy
What do we train?
We train the model parameters. In linear regression these are the coefficients for each feature. In random forest or XGBoost, these are the learned split rules inside many decision trees.
What happens after training?
After training, we have a fitted model plus evaluation results such as MAE, RMSE, and R2 that tell us how well it works on unseen data.
How do we predict later?
We give the trained model a new concrete mix with the same input columns. The model uses what it learned during training to estimate the strength of that new mix.
Real dataset used in the notebook
The notebook uses the UCI concrete compressive strength dataset with 1030 rows.
There are 8 input features and 1 target.
Strength ranges from about 2.33 to 82.60 MPa.
Average strength is about 35.82 MPa.
1
Past data
Concrete recipes plus real measured strength values.
2
Training
The model learns patterns that connect inputs to strength.
3
Fitted model
We keep the learned model after training is finished.
4
New prediction
A new mix goes in and the model returns predicted MPa.
1
Load data
Read the CSV and inspect the columns, summary statistics, and correlations.
2
Prepare features
Split the target from the predictors and create train and test sets.
3
Train models
Fit baseline and stronger regressors such as linear regression and random forest.
4
Evaluate
Compare MAE, RMSE, and R2 to see which model generalizes better.
5
Test a new mix
Enter a new recipe and preview how the workflow produces a strength estimate.
What the notebook is teaching
How tabular construction data becomes a machine-learning dataset.
Why train/test splitting comes before scaling and tuning.
How simple and advanced models can be compared fairly.
How evaluation metrics translate into model quality.
Data setup
Prepare the dataset and workflow settings
This step mirrors the early notebook cells: load the CSV, inspect the data, separate features from the target, and choose how the train/test split and preprocessing should work.
Workflow settings
What happens in this step
Read Concrete_Data.csv into a dataframe.
Check rows, columns, summary statistics, and correlations.
Separate the target column from the eight input variables.
Create train and test sets before scaling to avoid leakage.
Example rows from the real dataset
Cement
Slag
Fly Ash
Water
Superplasticizer
Age
Strength
540.0
0.0
0.0
162.0
2.5
28
79.99
540.0
0.0
0.0
162.0
2.5
28
61.89
332.5
142.5
0.0
228.0
0.0
270
40.27
198.6
132.4
0.0
192.0
0.0
360
44.30
Feature columns
Cement, slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and age.
Target column
Concrete compressive strength(MPa, megapascals)
Train set80%
Test set20%
Training rows teach the model. Test rows stay unseen until evaluation.
1. Import libraries
Start with pandas, plotting libraries, and the scikit-learn tools used throughout the workflow.
2. Load and inspect the dataset
These cells load the CSV, preview rows, and inspect summary statistics and correlations.
3. Split features and target
This is where the workflow creates the training and testing partitions.
4. Optional feature scaling
Scaling is shown after the split so the notebook avoids data leakage.
Model training
See what training means in machine learning
Training means the model looks at many known examples and adjusts its internal parameters so its predictions get closer to the real strength values. The notebook first fits a simple baseline, then a stronger tree-based model, and optionally a tuned version.
Linear Regression
A baseline model that learns a weighted linear relationship between the inputs and strength.
Simple and interpretable
Good benchmark model
May miss nonlinear behavior
Random Forest
An ensemble of decision trees that usually performs better on structured data with nonlinear patterns.
Captures interactions between variables
Provides feature importance
Strong fit for tabular data
XGBoost
An optional boosted-tree model that often performs very well, but may require extra installation and tuning.
Optional in the notebook
Often improves accuracy
Adds complexity
What gets trained in each model?
Linear Regression: learns one coefficient for each input feature and an intercept.
Random Forest: learns many decision trees and combines their outputs.
XGBoost: learns boosted trees stage by stage to reduce error.
In all cases, training uses the training set where both the inputs and the true strength values are known.
1
Read training rows
The model sees many concrete recipes and their known measured strength.
2
Learn patterns
It adjusts coefficients or tree rules to better connect inputs to strength.
3
Reduce error
The model keeps improving until predictions are closer to the known target values.
4
Keep the fitted model
After training, we keep the learned model and use it later for evaluation and prediction.
What we have after training
A fitted model object, prediction errors on the test set, and plots or tables that help compare models.
What we do not have
We do not have a perfect formula for every possible mix. The model only learns from the patterns present in the training data.
Why testing matters
Testing on unseen mixes helps us check whether the learned model generalizes instead of only memorizing the training set.
Training code
5. Train baseline model
The notebook starts with linear regression so every later result can be compared against a simple baseline.
6. Train advanced model
This block switches to the stronger tree-based model selected above.
7. Optional hyperparameter tuning
RandomizedSearchCV searches over a small parameter space to improve the tree model.
Feature importance preview
This visual mimics the notebook idea that tree models can rank which variables matter most.
How to read this
Higher bars mean the model relied more on that variable.
Age and cement often matter strongly for strength prediction.
Feature importance is most meaningful for tree-based models.
Evaluation
Compare model quality with meaningful regression metrics
Regression evaluation is different from classification. Instead of accuracy, this module uses MAE, RMSE, and R2 to compare how close predictions are to real measured strength values.
Model comparison chart
What the metrics mean
MAE is the average absolute prediction error in MPa.
RMSE penalizes larger mistakes more heavily than MAE.
R2 measures how much variance in strength the model explains.
How to interpret the chart
Lower MAE and RMSE are better.
Higher R2 is better.
A better model is not always the most complex one, but it often captures more nonlinear patterns.
8. Evaluate the models
These cells calculate MAE, RMSE, and R2 on the held-out test set.
9. Summarize the results
The notebook compares all trained models in one compact results table.
Testing
Test the workflow on a new concrete mix
Prediction happens after training. We enter a new concrete mix with the same input columns, and the trained model estimates its compressive strength. The browser preview below is a teaching demo, while the real notebook prediction comes from the trained scikit-learn model in Colab.
New mix design
Demo reminder
Demo only: the browser estimate is for understanding the workflow.
The real prediction in the notebook comes from the model fitted on the training data.
The purpose here is to show how a new mix moves into the testing stage.
Prediction preview
Choose a mix and click "Preview prediction" to see how the workflow would report model outputs.
New concrete mix
Apply same input columns
Run trained model
Get predicted MPa
10. Predict on new data
This final code block shows how the trained model is used on unseen concrete recipes.