Business

An LLM-Based Workflow for Automated Tabular Data Validation 

This article is part of a series of articles on automating data cleaning for any tabular dataset:

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

What is Data Validity?

Data validity refers to data conformity to expected formats, types, and value ranges. This standardisation within a single column ensures the uniformity of data according to implicit or explicit requirements.

Common issues related to data validity include:

  • Inappropriate variable types: Column data types that are not suited to analytical needs, e.g., temperature values in text format.
  • Columns with mixed data types: A single column containing both numerical and textual data.
  • Non-conformity to expected formats: For instance, invalid email addresses or URLs.
  • Out-of-range values: Column values that fall outside what is allowed or considered normal, e.g., negative age values or ages greater than 30 for high school students.
  • Time zone and DateTime format issues: Inconsistent or heterogeneous date formats within the dataset.
  • Lack of measurement standardisation or uniform scale: Variability in the units of measurement used for the same variable, e.g., mixing Celsius and Fahrenheit values for temperature.
  • Special characters or whitespace in numeric fields: Numeric data contaminated by non-numeric elements.

And the list goes on.

Error types such as duplicated records or entities and missing values do not fall into this category.

But what is the typical strategy to identifying such data validity issues? 

When data meets expectations

Data cleaning, while it can be very complex, can generally be broken down into two key phases:

1. Detecting data errors  

2. Correcting these errors.

At its core, data cleaning revolves around identifying and resolving discrepancies in datasets—specifically, values that violate predefined constraints, which are from expectations about the data..

It’s important to acknowledge a fundamental fact: it’s almost impossible, in real-world scenarios, to be exhaustive in identifying all potential data errors—the sources of data issues are virtually infinite, ranging from human input mistakes to system failures—and thus impossible to predict entirely. However, what we can do is define what we consider reasonably regular patterns in our data, known as data expectations—reasonable assumptions about what “correct” data should look like. For example:

  • If working with a dataset of high school students, we might expect ages to fall between 14 and 18 years old.
  • A customer database might require email addresses to follow a standard format (e.g., user@domain.com).

By establishing these expectations, we create a structured framework for detecting anomalies, making the data cleaning process both manageable and scalable.

These expectations are derived from both semantic and statistical analysis. We understand that the column name “age” refers to the well-known concept of time spent living. Other column names may be drawn from the lexical field of high school, and column statistics (e.g. minimum, maximum, mean, etc.) offer insights into the distribution and range of values. Taken together, this information helps determine our expectations for that column:

  • Age values should be integers
  • Values should fall between 14 and 18

Expectations tend to be as accurate as the time spent analysing the dataset. Naturally, if a dataset is used regularly by a team daily, the likelihood of discovering subtle data issues — and therefore refining expectations — increases significantly. That said, even simple expectations are rarely checked systematically in most environments, often due to time constraints or simply because it’s not the most enjoyable or high-priority task on the to-do list.

Once we’ve defined our expectations, the next step is to check whether the data actually meets them. This means applying data constraints and looking for violations. For each expectation, one or more constraints can be defined. These Data Quality rules can be translated into programmatic functions that return a binary decision — a Boolean value indicating whether a given value violates the tested constraint.

This strategy is commonly implemented in many data quality management tools, which offer ways to detect all data errors in a dataset based on the defined constraints. An iterative process then begins to address each issue until all expectations are satisfied — i.e. no violations remain.

This strategy may seem straightforward and easy to implement in theory. However, that’s often not what we see in practice — data quality remains a major challenge and a time-consuming task in many organisations.

An LLM-based workflow to generate data expectations, detect violations, and resolve them

This validation workflow is split into two main components: the validation of column data types and the compliance with expectations.

One might handle both simultaneously, but in our experiments, properly converting each column’s values in a data frame beforehand is a crucial preliminary step. It facilitates data cleaning by breaking down the entire process into a series of sequential actions, which improves performance, comprehension, and maintainability. This strategy is, of course, somewhat subjective, but it tends to avoid dealing with all data quality issues at once wherever possible.

To illustrate and understand each step of the whole process, we’ll consider this generated example:

Examples of data validity issues are spread across the table. Each row intentionally embeds one or more issues:

  • Row 1: Uses a non‑standard date format and an invalid URL scheme (non‑conformity to expected formats).
  • Row 2: Contains a price value as text (“twenty”) instead of a numeric value (inappropriate variable type).
  • Row 3: Has a rating given as “4 stars” mixed with numeric ratings elsewhere (mixed data types).
  • Row 4: Provides a rating value of “10”, which is out‑of‑range if ratings are expected to be between 1 and 5 (out‑of‑range value). Additionally, there is a typo in the word “Food”.
  • Row 5: Uses a price with a currency symbol (“20€”) and a rating with extra whitespace (“5 ”), showing a lack of measurement standardisation and special characters/whitespace issues.

Validate Column Data Types

Estimate column data types

The task here is to determine the most appropriate data type for each column in a data frame, based on the column’s semantic meaning and statistical properties. The classification is limited to the following options: string, int, float, datetime, and boolean. These categories are generic enough to cover most data types commonly encountered.

There are multiple ways to perform this classification, including deterministic approaches. The method chosen here leverages a large language model (Llm), prompted with information about each column and the overall data frame context to guide its decision:

  • The list of column names
  • Representative rows from the dataset, randomly sampled
  • Column statistics describing each column (e.g. number of unique values, proportion of top values, etc.)

Example:

1. Column Name: date 
  Description: Represents the date and time information associated with each record. 
  Suggested Data Type: datetime

2. Column Name: category 
  Description: Contains the categorical label defining the type or classification of the item. 
  Suggested Data Type: string

3. Column Name: price 
  Description: Holds the numeric price value of an item expressed in monetary terms. 
  Suggested Data Type: float

4. Column Name: image_url 
  Description: Stores the web address (URL) pointing to the image of the item. 
  Suggested Data Type: string

5. Column Name: rating 
  Description: Represents the evaluation or rating of an item using a numeric score. 
  Suggested Data Type: int

Convert Column Values into the Estimated Data Type

Once the data type of each column has been predicted, the conversion of values can begin. Depending on the table framework used, this step might differ slightly, but the underlying logic remains similar. For instance, in the CleanMyExcel.io service, Pandas is used as the core data frame engine. However, other libraries like Polars or PySpark are equally capable within the Python ecosystem.
All non-convertible values are set aside for further investigation.

Analyse Non-convertible Values and Propose Substitutes

This step can be viewed as an imputation task. The previously flagged non-convertible values violate the column’s expected data type. Because the potential causes are so diverse, this step can be quite challenging. Once again, an LLM offers a helpful trade-off to interpret the conversion errors and suggest possible replacements.
Sometimes, the correction is straightforward—for example, converting an age value of twenty into the integer 20. In many other cases, a substitute is not so obvious, and tagging the value with a sentinel (placeholder) value is a better choice. In Pandas, for instance, the special object pd.NA is suitable for such cases.

Example:

{
  “violations”: [
    {
      “index”: 2,
      “column_name”: “rating”,
      “value”: “4 stars”,
      “violation”: “Contains non-numeric text in a numeric rating field.”,
      “substitute”: “4”
    },
   {
      “index”: 1,
      “column_name”: “price”,
      “value”: “twenty”,
      “violation”: “Textual representation that cannot be directly converted to a number.”,
      “substitute”: “20”
    },
    {
      “index”: 4,
      “column_name”: “price”,
      “value”: “20€”,
      “violation”: “Price value contains an extraneous currency symbol.”,
      “substitute”: “20”
    }
  ]
}

Replace Non-convertible Values

At this point, a programmatic function is applied to replace the problematic values with the proposed substitutes. The column is then tested again to ensure all values can now be converted into the estimated data type. If successful, the workflow proceeds to the expectations module. Otherwise, the previous steps are repeated until the column is validated.

Validate Column Data Expectations

Generate Expectations for All Columns

The following elements are provided:

  • Data dictionary: column name, a short description, and the expected data type
  • Representative rows from the dataset, randomly sampled
  • Column statistics, such as number of unique values and proportion of top values

Based on each column’s semantic meaning and statistical properties, the goal is to define validation rules and expectations that ensure data quality and integrity. These expectations should fall into one of the following categories related to standardisation:

  • Valid ranges or intervals
  • Expected formats (e.g. for emails or phone numbers)
  • Allowed values (e.g. for categorical fields)
  • Column data standardisation (e.g. ‘Mr’, ‘Mister’, ‘Mrs’, ‘Mrs.’ becomes [‘Mr’, ‘Mrs’])

Example:

Column name: date

• Expectation: Value must be a valid datetime.
 - Reasoning: The column represents date and time information so each entry should follow a standard datetime format (for example, ISO 8601). 
• Expectation: Datetime values should include timezone information (preferably UTC).
 - Reasoning: The provided sample timestamps include explicit UTC timezone information. This ensures consistency in time-based analyses.

──────────────────────────────
Column name: category

• Expectation: Allowed values should be standardized to a predefined set.
 - Reasoning: Based on the semantic meaning, valid categories might include “Books”, “Electronics”, “Food”, “Clothing”, and “Furniture”. (Note: The sample includes “Fod”, which likely needs correcting to “Food”.) 
• Expectation: Entries should follow a standardized textual format (e.g., Title Case).
 - Reasoning: Consistent capitalization and spelling will improve downstream analyses and reduce data cleaning issues.

──────────────────────────────
Column name: price

• Expectation: Value must be a numeric float.
 - Reasoning: Since the column holds monetary amounts, entries should be stored as numeric values (floats) for accurate calculations.
• Expectation: Price values should fall within a valid non-negative numeric interval (e.g., price ≥ 0).
 - Reasoning: Negative prices generally do not make sense in a pricing context. Even if the minimum observed value in the sample is 9.99, allowing zero or positive values is more realistic for pricing data.

──────────────────────────────
Column name: image_url

• Expectation: Value must be a valid URL with the expected format.
 - Reasoning: Since the column stores image web addresses, each URL should adhere to standard URL formatting patterns (e.g., including a proper protocol schema).
• Expectation: The URL should start with “https://”.
 - Reasoning: The sample shows that one URL uses “htp://”, which is likely a typo. Enforcing a secure (https) URL standard improves data reliability and user security.

──────────────────────────────
Column name: rating

• Expectation: Value must be an integer.
 - Reasoning: The evaluation score is numeric, and as seen in the sample the rating is stored as an integer.
• Expectation: Rating values should fall within a valid interval, such as between 1 and 5.
 - Reasoning: In many contexts, ratings are typically on a scale of 1 to 5. Although the sample includes a value of 10, it is likely a data quality issue. Enforcing this range standardizes the evaluation scale.

Generate Validation Code

Once expectations have been defined, the goal is to create a structured code that checks the data against these constraints. The code format may vary depending on the chosen validation library, such as Pandera (used in CleanMyExcel.io), Pydantic, Great Expectations, Soda, etc.

To make debugging easier, the validation code should apply checks elementwise so that when a failure occurs, the row index and column name are clearly identified. This helps to pinpoint and resolve issues effectively.

Analyse Violations and Propose Substitutes

When a violation is detected, it must be resolved. Each issue is flagged with a short explanation and a precise location (row index + column name). An LLM is used to estimate the best possible replacement value based on the violation’s description. Again, this proves useful due to the variety and unpredictability of data issues. If the appropriate substitute is unclear, a sentinel value is applied, depending on the data frame package in use.

Example:

{
  “violations”: [
    {
      “index”: 3,
      “column_name”: “category”,
      “value”: “Fod”,
      “violation”: “category should be one of [‘Books’, ‘Electronics’, ‘Food’, ‘Clothing’, ‘Furniture’]”,
      “substitute”: “Food”
    },
    {
      “index”: 0,
      “column_name”: “image_url”,
      “value”: “htp://imageexample.com/pic.jpg”,
      “violation”: “image_url should start with ‘https://’”,
      “substitute”: “https://imageexample.com/pic.jpg”
    },
    {
      “index”: 3,
      “column_name”: “rating”,
      “value”: “10”,
      “violation”: “rating should be between 1 and 5”,
      “substitute”: “5”
    }
  ]
}

The remaining steps are similar to the iteration process used during the validation of column data types. Once all violations are resolved and no further issues are detected, the data frame is fully validated.

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Conclusion

Expectations may sometimes lack domain expertise — integrating human input can help surface more diverse, specific, and reliable expectations.

A key challenge lies in automation during the resolution process. A human-in-the-loop approach could introduce more transparency, particularly in the selection of substitute or imputed values.

This article is part of a series of articles on automating data cleaning for any tabular dataset:

In upcoming articles, we’ll explore related topics already on the roadmap, including:

  • A detailed description of the spreadsheet encoder used in the article above.
  • Data uniqueness: preventing duplicate entities within the dataset.
  • Data completeness: handling missing values effectively.
  • Evaluating data reshaping, validity, and other key aspects of data quality.

Stay tuned!

Thank you to Marc Hobballah for reviewing this article and providing feedback.

All images, unless otherwise noted, are by the author.

The post An LLM-Based Workflow for Automated Tabular Data Validation  appeared first on Towards Data Science.

Picture of John Doe
John Doe

Sociosqu conubia dis malesuada volutpat feugiat urna tortor vehicula adipiscing cubilia. Pede montes cras porttitor habitasse mollis nostra malesuada volutpat letius.

Related Article

Leave a Reply

Your email address will not be published. Required fields are marked *

We would love to hear from you!

Please record your message.

Record, Listen, Send

Allow access to your microphone

Click "Allow" in the permission dialog. It usually appears under the address bar in the upper left side of the window. We respect your privacy.

Microphone access error

It seems your microphone is disabled in the browser settings. Please go to your browser settings and enable access to your microphone.

Speak now

00:00

Canvas not available.

Reset recording

Are you sure you want to start a new recording? Your current recording will be deleted.

Oops, something went wrong

Error occurred during uploading your audio. Please click the Retry button to try again.

Send your recording

Thank you

Meet Eve: Your AI Training Assistant

Welcome to Enlightening Methodology! We are excited to introduce Eve, our innovative AI-powered assistant designed specifically for our organization. Eve represents a glimpse into the future of artificial intelligence, continuously learning and growing to enhance the user experience across both healthcare and business sectors.

In Healthcare

In the healthcare category, Eve serves as a valuable resource for our clients. She is capable of answering questions about our business and providing "Day in the Life" training scenario examples that illustrate real-world applications of the training methodologies we employ. Eve offers insights into our unique compliance tool, detailing its capabilities and how it enhances operational efficiency while ensuring adherence to all regulatory statues and full HIPAA compliance. Furthermore, Eve can provide clients with compelling reasons why Enlightening Methodology should be their company of choice for Electronic Health Record (EHR) implementations and AI support. While Eve is purposefully designed for our in-house needs and is just a small example of what AI can offer, her continuous growth highlights the vast potential of AI in transforming healthcare practices.

In Business

In the business section, Eve showcases our extensive offerings, including our cutting-edge compliance tool. She provides examples of its functionality, helping organizations understand how it can streamline compliance processes and improve overall efficiency. Eve also explores our cybersecurity solutions powered by AI, demonstrating how these technologies can protect organizations from potential threats while ensuring data integrity and security. While Eve is tailored for internal purposes, she represents only a fraction of the incredible capabilities that AI can provide. With Eve, you gain access to an intelligent assistant that enhances training, compliance, and operational capabilities, making the journey towards AI implementation more accessible. At Enlightening Methodology, we are committed to innovation and continuous improvement. Join us on this exciting journey as we leverage Eve's abilities to drive progress in both healthcare and business, paving the way for a smarter and more efficient future. With Eve by your side, you're not just engaging with AI; you're witnessing the growth potential of technology that is reshaping training, compliance and our world! Welcome to Enlightening Methodology, where innovation meets opportunity!