Inferencer module¶
inferencer.py¶
Wrapper around the csvtype_ext Python bindings.
Examples
from csvtype.inferencer import Inferencer as Inf
from csvtype.patterns import alpha_patterns, int_patterns, default_na_values
import pandas as pd
df = pd.DataFrame({
'mostly_alpha': ['a', 'b', 'c', 1, 1.0],
'mostly_int': [0, 'a', 2, 64, 'NA']
})
df.to_csv('test.csv', sep=';', index=False)
col_type_patterns = {'int': int_patterns, 'alpha': alpha_patterns}
inf = Inf(
filepath='test.csv',
delimiter=';',
col_type_patterns=col_type_patterns,
na_values=default_na_values
)
inf.infer_types()
print(inf.col_names)
['mostly_alpha', 'mostly_int']
print(inf.num_rows)
5
print(inf.col_type_candidates(pandas=True))
mostly_alpha mostly_int
int 0.2 0.6
alpha 0.6 0.2
NA 0.0 0.2
other 0.2 0.0
print(inf.most_likely_col_types())
{'mostly_alpha': 'alpha', 'mostly_int': 'int'}
-
class
csvtype.inferencer.Inferencer(*args: Any, **kwargs: Any) A wrapper around the Python bindings of the underlying C++ code, the csvtype_ext Inf class.
-
filepath The path to the CSV file to be processed
- Type
str
-
delimiter The delimiter of the CSV file to be processed
- Type
str
-
col_type_patterns A dict of str, list of strings pairs. The string keys represent the names of column types, while the list of strings belonging to that key are regex patterns to be used for inferring that column type.
- Type
dict of str, list of strings pairs, optional
-
na_values A set of strings that represent literal NA values. No regex matching will be used for these, but rather direct string matching.
- Type
set, optional
-
multithreading Whether to use a new thread for each field in a row in the CSV file. This might severely slow down the inferencing, depending on many factors such as the number of columns and the number of CPU cores.
- Type
bool, optional
-
save_types_file Whether to save the inferred types per field in a CSV file.
- Type
bool, optional
-
types_filepath The path to save the types file at.
- Type
str, optional
-
rolling_cache_window A rolling cache is used that might speed up the inferencing process, which is true if it’s likely that there are identical values in consecutive rows of one column. This rolling cached is kept for rolling_cache_window rows and then emptied. Changing this parameter might speed up or slow down the inferencing process.
- Type
int, optional
-
num_rows Read-only. The number of rows in the CSV file. Available after calling infer_types().
- Type
int
-
col_names Read-only. The column names of the CSV file. Available after calling infer_types(). Unnamed column will be named Untitled_<index>.
- Type
list of str
-
__init__(filepath: str = None, delimiter: str = ',', col_type_patterns: Dict[str, List[str]] = {'alpha': ['^[a-zA-Z]+$'], 'bool': ['^(true|false|yes|no|ja|nee|y|n|j|0|1|t|f|waar|onwaar)$'], 'date': ['^(\\d{1,2})(-|\\.|/)(\\d{1,2})(-|\\.|/)(\\d{2}|\\d{4})(\\s\\d{1,2}:\\d{1,2}:\\d{1,2})?(\\d{1,2}:\\d{1,2})?$', '^(\\d{1,2})/(\\d{1,2})/(\\d{2}|\\d{4})(\\s\\d{1,2}:\\d{1,2}:\\d{1,2})?(\\d{1,2}:\\d{1,2})?$', '^(\\d{2}|\\d{4})(-|\\.|/)(\\d{1,2})(-|\\.|/)(\\d{1,2})(\\s\\d{1,2}:\\d{1,2}:\\d{1,2})?(\\d{1,2}:\\d{1,2})?'], 'float': ['^([-+]?\\d*\\.\\d+)$'], 'int': ['^[-+]?\\d+$']}, na_values: Set[str] = {'', '#N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'}, multithreading: bool = False, save_types_file: bool = False, types_filepath: str = '', rolling_cache_window: int = 5) → csvtype_ext.Inf Initializes the inferencer instance.
- Parameters
filepath (str) – The path to the CSV file to be processed
delimiter (str) – The delimiter of the CSV file to be processed
col_type_patterns (dict of str, list of strings pairs, optional) – A dict of str, list of strings pairs. The string keys represent the names of column types, while the list of strings belonging to that key are PCRE-style regex patterns to be used for inferring that column type.
na_values (set, optional) – A set of strings that represent literal NA values. No regex matching will be used for these, but rather direct string matching.
multithreading (bool, optional) – Whether to use a new thread for each field in a row in the CSV file. This might severely slow down the inferencing, depending on many factors such as the number of columns and the number of CPU cores.
save_types_file (bool, optional) – Whether to save the inferred types per field in a CSV file.
types_filepath (str, optional) – The path to save the types file at.
rolling_cache_window (int, optional) – A rolling cache is used that might speed up the inferencing process, which is true if it’s likely that there are identical values in consecutive rows of one column. This rolling cached is kept for rolling_cache_window rows and then emptied. Changing this parameter might speed up or slow down the inferencing process.
- Raises
OSError – If the given filepath does not exist.
Notes
The boost::regex regex implementation is used in the C++ source. It is currently not possible to pass any boost::regex flags from these Python bindings (such as boost::regex::icase) to the C++ source. This might be added in a later release.
By default, the boost::regex::perl flag is passed to the constructor, so you have to use PCRE regex patterns.
- Returns
In instance of the Python wrapper of the csvtype Inferencer class.
- Return type
csvtype.Inferencer
-
col_type_candidates(pandas: bool = False) → [typing.Dict, <class ‘pandas.core.frame.DataFrame’>] Computes col type candidate ratios per column based on col type counts per column as given by the C++ inferencer.
- Parameters
pandas (bool) – Whether the output should be given as a Pandas df.
- Returns
A datastructure that gives a ratio (between 0 and 1) for each column and column type combination indicating the likeliness the column is of the specific column type.
- Return type
dict, pandas.DataFrame
-
infer_types() → None Infer the most likely column types based on the col type regex patterns given to initializer.
-
most_likely_col_types() → Dict Computes most likely col type for each column.
- Returns
A dict with columns as keys and the most likely col types as values.
- Return type
dict
-
patterns.py¶
Regex patterns for common column types
-
csvtype.patterns.default_patterns Default col_type_patterns used by Inferencer.__init__ in case no col type patterns dict were given. Includes patterns for float, int, bool, gender and date.
- Type
dict of str, list of str pairs
-
csvtype.patterns.default_na_values Default na_values used by Inferencer.__init__ in case no NA values set was given. Includes most common NA values, including those used in MS Excel.
- Type
set of str