Fix Value Errors in Your Data Set Using Python

I’m trying to scale my data from Python a .csv file to fit a range between 0 and 1, but I keep running into this frustrating “ValueError” that says my input contains NaN, infinity, or a value too large for dtype('float64'). In previous cases, I could figure out the cause, like an empty cell, blank spaces, or incompatible characters. This time, though, I can’t pinpoint what’s causing the issue, and I have too many data points to go through each manually. Is there a quick method or trick (maybe even in Excel) to identify exactly where these problematic values are? I’d appreciate any tips or code tweaks that could help me locate NaNs or extreme values without compromising data confidentiality. Here’s the code snippet I’m working with.

Error Code:

codeimport pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load training data set from CSV file
training_data_df = pd.read_csv("mtth_train.csv")

# Load testing data set from CSV file
test_data_df = pd.read_csv("mtth_test.csv")

# Data needs to be scaled to a small range like 0 to 1
scaler = MinMaxScaler(feature_range= (0, 1))

# Scale both the training inputs and outputs
scaled_training = scaler.fit_transform(training_data_df)
scaled_testing = scaler.transform(test_data_df)

# Print out the adjustment that the scaler applied to the total_earnings column of data
print("Note: Parameters were scaled by multiplying by {:.10f} and adding {:.6f}".format(scaler.scale_[8], scaler.min_[8]))

# Create new pandas DataFrame objects from the scaled data
scaled_training_df = pd.DataFrame(scaled_training, columns=training_data_df.columns.values)
scaled_testing_df = pd.DataFrame(scaled_testing, columns=test_data_df.columns.values)

# Save scaled data dataframes to new CSV files
scaled_training_df.to_csv("mtth_train_scaled", index=False)
scaled_testing_df.to_csv("mtth_test_scaled.csv", index=False)

To solve this issue of identifying problematic values in your dataset (such as NaN, infinity, or excessively large values), you can add a few steps to your code to detect and locate these values. This way, you won’t need to manually check through your entire dataset.

Solution:

  1. Check for NaNs, infinities, and extremely large values: We can use Pandas functions to scan the dataset for these errors before scaling.
  2. Handle the problematic values: Depending on your data and requirements, you may choose to drop, fill, or transform these values.
  3. Re-run the scaling process once the data is clean.

Here’s the modified code with explanations for each step:

Correct Code:

codeimport pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load training and testing data sets from CSV files
training_data_df = pd.read_csv("mtth_train.csv")
test_data_df = pd.read_csv("mtth_test.csv")

# 1. Check for NaN values in the dataset
# Training Data
if training_data_df.isnull().values.any():
print("Training data contains NaN values.")
print(training_data_df[training_data_df.isnull().any(axis=1)])

# Testing Data
if test_data_df.isnull().values.any():
print("Testing data contains NaN values.")
print(test_data_df[test_data_df.isnull().any(axis=1)])

# 2. Check for infinity values in the dataset
# Replace infinities with NaN (optional, based on how you want to handle them)
training_data_df.replace([np.inf, -np.inf], np.nan, inplace=True)
test_data_df.replace([np.inf, -np.inf], np.nan, inplace=True)

# 3. Check for very large values
# Here we define a threshold, such as 1e10 (10 billion), which you can adjust as needed
large_value_threshold = 1e10

# Check training data for large values
large_values_training = training_data_df > large_value_threshold
if large_values_training.values.any():
print("Training data contains very large values above the threshold.")
print(training_data_df[large_values_training.any(axis=1)])

# Check testing data for large values
large_values_testing = test_data_df > large_value_threshold
if large_values_testing.values.any():
print("Testing data contains very large values above the threshold.")
print(test_data_df[large_values_testing.any(axis=1)])

# 4. Remove or fill NaN values (Optional: You can fill NaNs if preferred)
training_data_df.fillna(0, inplace=True)
test_data_df.fillna(0, inplace=True)

# 5. Scale data to the range 0 to 1
scaler = MinMaxScaler(feature_range=(0, 1))

# Scale the training data
scaled_training = scaler.fit_transform(training_data_df)

# Scale the testing data
scaled_testing = scaler.transform(test_data_df)

# Print out the adjustment that the scaler applied (for debugging if needed)
print("Note: Parameters were scaled by multiplying by {:.10f} and adding {:.6f}".format(scaler.scale_[0], scaler.min_[0]))

# Create new pandas DataFrame objects from the scaled data
scaled_training_df = pd.DataFrame(scaled_training, columns=training_data_df.columns.values)
scaled_testing_df = pd.DataFrame(scaled_testing, columns=test_data_df.columns.values)

# Save scaled data to new CSV files
scaled_training_df.to_csv("mtth_train_scaled.csv", index=False)
scaled_testing_df.to_csv("mtth_test_scaled.csv", index=False)

Explanation of Key Steps:

  1. NaN Check:
    • We use .isnull().values.any() to identify if there are any NaNs in the dataset. If NaNs are found, we display the rows containing them.
  2. Infinity Check:
    • We replace any infinity values with NaNs using .replace([np.inf, -np.inf], np.nan, inplace=True), as infinities could interfere with scaling.
  3. Large Value Check:
    • We define a threshold (e.g., 1e10) for identifying excessively large values. We then print any rows that exceed this threshold for further examination.
  4. Fill or Remove NaNs:
    • After identifying NaNs and infinities, we fill them with 0 (or handle them according to your preference).
  5. Scaling the Data:
    • Once the data is cleaned, we apply the MinMaxScaler to scale the values to the range of 0 to 1 as intended.

By following these steps, you should be able to identify, handle, and scale your data without encountering the “ValueError: Input contains NaN, infinity or a value too large” issue.

Related blog posts