Module 4 workflow tool

Machine learning workflow

This page turns the concrete strength notebook into a step-by-step workflow. It explains the regression problem, shows how the data moves through preprocessing and modeling, and lets you preview evaluation and testing before opening Colab.

Open the original Colab

Workflow stages

Understand what machine learning is doing here

Module 4 is a supervised machine-learning workflow. We give the model many past concrete mixtures together with their real measured strength, and the model learns a pattern that can be used later on a new mix design.

What is machine learning here?

In this module, machine learning means using past data to learn a relationship between concrete ingredients, curing age, and final compressive strength.

  • Input data: cement, slag, fly ash, water, superplasticizer, aggregates, and age
  • Target value: compressive strength in MPa
  • Goal: estimate strength for a new recipe

What kind of problem is this?

This is a regression problem, not classification. The model predicts a number, not a category.

  • Output is continuous: for example 32.4 MPa or 51.8 MPa
  • The model is trying to get close to the true measured strength
  • Later we judge the model by prediction error, not by accuracy

What do we train?

We train the model parameters. In linear regression these are the coefficients for each feature. In random forest or XGBoost, these are the learned split rules inside many decision trees.

What happens after training?

After training, we have a fitted model plus evaluation results such as MAE, RMSE, and R2 that tell us how well it works on unseen data.

How do we predict later?

We give the trained model a new concrete mix with the same input columns. The model uses what it learned during training to estimate the strength of that new mix.

Real dataset used in the notebook

  • The notebook uses the UCI concrete compressive strength dataset with 1030 rows.
  • There are 8 input features and 1 target.
  • Strength ranges from about 2.33 to 82.60 MPa.
  • Average strength is about 35.82 MPa.

1

Past data

Concrete recipes plus real measured strength values.

2

Training

The model learns patterns that connect inputs to strength.

3

Fitted model

We keep the learned model after training is finished.

4

New prediction

A new mix goes in and the model returns predicted MPa.

1

Load data

Read the CSV and inspect the columns, summary statistics, and correlations.

2

Prepare features

Split the target from the predictors and create train and test sets.

3

Train models

Fit baseline and stronger regressors such as linear regression and random forest.

4

Evaluate

Compare MAE, RMSE, and R2 to see which model generalizes better.

5

Test a new mix

Enter a new recipe and preview how the workflow produces a strength estimate.

What the notebook is teaching

  • How tabular construction data becomes a machine-learning dataset.
  • Why train/test splitting comes before scaling and tuning.
  • How simple and advanced models can be compared fairly.
  • How evaluation metrics translate into model quality.

Data setup

Prepare the dataset and workflow settings

This step mirrors the early notebook cells: load the CSV, inspect the data, separate features from the target, and choose how the train/test split and preprocessing should work.

Workflow settings

What happens in this step

  • Read Concrete_Data.csv into a dataframe.
  • Check rows, columns, summary statistics, and correlations.
  • Separate the target column from the eight input variables.
  • Create train and test sets before scaling to avoid leakage.

Example rows from the real dataset

Cement Slag Fly Ash Water Superplasticizer Age Strength
540.0 0.0 0.0 162.0 2.5 28 79.99
540.0 0.0 0.0 162.0 2.5 28 61.89
332.5 142.5 0.0 228.0 0.0 270 40.27
198.6 132.4 0.0 192.0 0.0 360 44.30

Feature columns

Cement, slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and age.

Target column

Concrete compressive strength(MPa, megapascals)

Train set 80%
Test set 20%

Training rows teach the model. Test rows stay unseen until evaluation.

1. Import libraries

Start with pandas, plotting libraries, and the scikit-learn tools used throughout the workflow.

2. Load and inspect the dataset

These cells load the CSV, preview rows, and inspect summary statistics and correlations.

3. Split features and target

This is where the workflow creates the training and testing partitions.

4. Optional feature scaling

Scaling is shown after the split so the notebook avoids data leakage.

Model training

See what training means in machine learning

Training means the model looks at many known examples and adjusts its internal parameters so its predictions get closer to the real strength values. The notebook first fits a simple baseline, then a stronger tree-based model, and optionally a tuned version.

Linear Regression

A baseline model that learns a weighted linear relationship between the inputs and strength.

  • Simple and interpretable
  • Good benchmark model
  • May miss nonlinear behavior

Random Forest

An ensemble of decision trees that usually performs better on structured data with nonlinear patterns.

  • Captures interactions between variables
  • Provides feature importance
  • Strong fit for tabular data

XGBoost

An optional boosted-tree model that often performs very well, but may require extra installation and tuning.

  • Optional in the notebook
  • Often improves accuracy
  • Adds complexity

What gets trained in each model?

  • Linear Regression: learns one coefficient for each input feature and an intercept.
  • Random Forest: learns many decision trees and combines their outputs.
  • XGBoost: learns boosted trees stage by stage to reduce error.
  • In all cases, training uses the training set where both the inputs and the true strength values are known.

1

Read training rows

The model sees many concrete recipes and their known measured strength.

2

Learn patterns

It adjusts coefficients or tree rules to better connect inputs to strength.

3

Reduce error

The model keeps improving until predictions are closer to the known target values.

4

Keep the fitted model

After training, we keep the learned model and use it later for evaluation and prediction.

What we have after training

A fitted model object, prediction errors on the test set, and plots or tables that help compare models.

What we do not have

We do not have a perfect formula for every possible mix. The model only learns from the patterns present in the training data.

Why testing matters

Testing on unseen mixes helps us check whether the learned model generalizes instead of only memorizing the training set.

Training code

5. Train baseline model

The notebook starts with linear regression so every later result can be compared against a simple baseline.

6. Train advanced model

This block switches to the stronger tree-based model selected above.

7. Optional hyperparameter tuning

RandomizedSearchCV searches over a small parameter space to improve the tree model.

Feature importance preview

This visual mimics the notebook idea that tree models can rank which variables matter most.

How to read this

  • Higher bars mean the model relied more on that variable.
  • Age and cement often matter strongly for strength prediction.
  • Feature importance is most meaningful for tree-based models.

Evaluation

Compare model quality with meaningful regression metrics

Regression evaluation is different from classification. Instead of accuracy, this module uses MAE, RMSE, and R2 to compare how close predictions are to real measured strength values.

Model comparison chart

What the metrics mean

  • MAE is the average absolute prediction error in MPa.
  • RMSE penalizes larger mistakes more heavily than MAE.
  • R2 measures how much variance in strength the model explains.

How to interpret the chart

  • Lower MAE and RMSE are better.
  • Higher R2 is better.
  • A better model is not always the most complex one, but it often captures more nonlinear patterns.
8. Evaluate the models

These cells calculate MAE, RMSE, and R2 on the held-out test set.

9. Summarize the results

The notebook compares all trained models in one compact results table.

Testing

Test the workflow on a new concrete mix

Prediction happens after training. We enter a new concrete mix with the same input columns, and the trained model estimates its compressive strength. The browser preview below is a teaching demo, while the real notebook prediction comes from the trained scikit-learn model in Colab.

New mix design

Demo reminder

  • Demo only: the browser estimate is for understanding the workflow.
  • The real prediction in the notebook comes from the model fitted on the training data.
  • The purpose here is to show how a new mix moves into the testing stage.

Prediction preview

Choose a mix and click "Preview prediction" to see how the workflow would report model outputs.

New concrete mix
Apply same input columns
Run trained model
Get predicted MPa
10. Predict on new data

This final code block shows how the trained model is used on unseen concrete recipes.