Inferencer module

inferencer.py

Wrapper around the csvtype_ext Python bindings.

Examples

from csvtype.inferencer import Inferencer as Inf
from csvtype.patterns import alpha_patterns, int_patterns, default_na_values
import pandas as pd

df = pd.DataFrame({
    'mostly_alpha': ['a', 'b', 'c', 1, 1.0],
    'mostly_int': [0, 'a', 2, 64, 'NA']
})
df.to_csv('test.csv', sep=';', index=False)

col_type_patterns = {'int': int_patterns, 'alpha': alpha_patterns}
inf = Inf(
    filepath='test.csv',
    delimiter=';',
    col_type_patterns=col_type_patterns,
    na_values=default_na_values
)

inf.infer_types()

print(inf.col_names)
['mostly_alpha', 'mostly_int']

print(inf.num_rows)
5

print(inf.col_type_candidates(pandas=True))
       mostly_alpha  mostly_int
int             0.2         0.6
alpha           0.6         0.2
NA              0.0         0.2
other           0.2         0.0

print(inf.most_likely_col_types())
{'mostly_alpha': 'alpha', 'mostly_int': 'int'}
class csvtype.inferencer.Inferencer(*args: Any, **kwargs: Any)

A wrapper around the Python bindings of the underlying C++ code, the csvtype_ext Inf class.

filepath

The path to the CSV file to be processed

Type

str

delimiter

The delimiter of the CSV file to be processed

Type

str

col_type_patterns

A dict of str, list of strings pairs. The string keys represent the names of column types, while the list of strings belonging to that key are regex patterns to be used for inferring that column type.

Type

dict of str, list of strings pairs, optional

na_values

A set of strings that represent literal NA values. No regex matching will be used for these, but rather direct string matching.

Type

set, optional

multithreading

Whether to use a new thread for each field in a row in the CSV file. This might severely slow down the inferencing, depending on many factors such as the number of columns and the number of CPU cores.

Type

bool, optional

save_types_file

Whether to save the inferred types per field in a CSV file.

Type

bool, optional

types_filepath

The path to save the types file at.

Type

str, optional

rolling_cache_window

A rolling cache is used that might speed up the inferencing process, which is true if it’s likely that there are identical values in consecutive rows of one column. This rolling cached is kept for rolling_cache_window rows and then emptied. Changing this parameter might speed up or slow down the inferencing process.

Type

int, optional

num_rows

Read-only. The number of rows in the CSV file. Available after calling infer_types().

Type

int

col_names

Read-only. The column names of the CSV file. Available after calling infer_types(). Unnamed column will be named Untitled_<index>.

Type

list of str

__init__(filepath: str = None, delimiter: str = ',', col_type_patterns: Dict[str, List[str]] = {'alpha': ['^[a-zA-Z]+$'], 'bool': ['^(true|false|yes|no|ja|nee|y|n|j|0|1|t|f|waar|onwaar)$'], 'date': ['^(\\d{1,2})(-|\\.|/)(\\d{1,2})(-|\\.|/)(\\d{2}|\\d{4})(\\s\\d{1,2}:\\d{1,2}:\\d{1,2})?(\\d{1,2}:\\d{1,2})?$', '^(\\d{1,2})/(\\d{1,2})/(\\d{2}|\\d{4})(\\s\\d{1,2}:\\d{1,2}:\\d{1,2})?(\\d{1,2}:\\d{1,2})?$', '^(\\d{2}|\\d{4})(-|\\.|/)(\\d{1,2})(-|\\.|/)(\\d{1,2})(\\s\\d{1,2}:\\d{1,2}:\\d{1,2})?(\\d{1,2}:\\d{1,2})?'], 'float': ['^([-+]?\\d*\\.\\d+)$'], 'int': ['^[-+]?\\d+$']}, na_values: Set[str] = {'', '#N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'}, multithreading: bool = False, save_types_file: bool = False, types_filepath: str = '', rolling_cache_window: int = 5) → csvtype_ext.Inf

Initializes the inferencer instance.

Parameters
  • filepath (str) – The path to the CSV file to be processed

  • delimiter (str) – The delimiter of the CSV file to be processed

  • col_type_patterns (dict of str, list of strings pairs, optional) – A dict of str, list of strings pairs. The string keys represent the names of column types, while the list of strings belonging to that key are PCRE-style regex patterns to be used for inferring that column type.

  • na_values (set, optional) – A set of strings that represent literal NA values. No regex matching will be used for these, but rather direct string matching.

  • multithreading (bool, optional) – Whether to use a new thread for each field in a row in the CSV file. This might severely slow down the inferencing, depending on many factors such as the number of columns and the number of CPU cores.

  • save_types_file (bool, optional) – Whether to save the inferred types per field in a CSV file.

  • types_filepath (str, optional) – The path to save the types file at.

  • rolling_cache_window (int, optional) – A rolling cache is used that might speed up the inferencing process, which is true if it’s likely that there are identical values in consecutive rows of one column. This rolling cached is kept for rolling_cache_window rows and then emptied. Changing this parameter might speed up or slow down the inferencing process.

Raises

OSError – If the given filepath does not exist.

Notes

The boost::regex regex implementation is used in the C++ source. It is currently not possible to pass any boost::regex flags from these Python bindings (such as boost::regex::icase) to the C++ source. This might be added in a later release.

By default, the boost::regex::perl flag is passed to the constructor, so you have to use PCRE regex patterns.

Returns

In instance of the Python wrapper of the csvtype Inferencer class.

Return type

csvtype.Inferencer

col_type_candidates(pandas: bool = False) → [typing.Dict, <class ‘pandas.core.frame.DataFrame’>]

Computes col type candidate ratios per column based on col type counts per column as given by the C++ inferencer.

Parameters

pandas (bool) – Whether the output should be given as a Pandas df.

Returns

A datastructure that gives a ratio (between 0 and 1) for each column and column type combination indicating the likeliness the column is of the specific column type.

Return type

dict, pandas.DataFrame

infer_types() → None

Infer the most likely column types based on the col type regex patterns given to initializer.

most_likely_col_types() → Dict

Computes most likely col type for each column.

Returns

A dict with columns as keys and the most likely col types as values.

Return type

dict

patterns.py

Regex patterns for common column types

csvtype.patterns.default_patterns

Default col_type_patterns used by Inferencer.__init__ in case no col type patterns dict were given. Includes patterns for float, int, bool, gender and date.

Type

dict of str, list of str pairs

csvtype.patterns.default_na_values

Default na_values used by Inferencer.__init__ in case no NA values set was given. Includes most common NA values, including those used in MS Excel.

Type

set of str