Pandas Remove Duplicate Timestamps on Read Csv and Reindex
Pandas for time serial data — tricks and tips
There are some Pandas DataFrame manipulations that I keep looking up how to do. I am recording these hither to save myself time. These may assistance you too.
Time series data
Convert cavalcade to datetime with given format
df['day_time'] = pd.to_datetime(df['day_time'], format='%Y-%m-%d %H:%1000:%S') 0 2012–10–12 00:00:00
1 2012–x–12 00:30:00
2 2012–ten–12 01:00:00
three 2012–10–12 01:30:00
Re-index a dataframe to interpolate missing values (eg every 30 mins below). You lot need to have a datetime index on the df before running this.
full_idx = pd.date_range(start=df['day_time'].min(), end=df['day_time'].max(), freq='30T') df = (
df
.groupby('LCLid', as_index=False)
.employ(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, driblet=Truthful)
.sort_index()
)
Notice missing dates in a DataFrame
# Note date_range is inclusive of the finish date
ref_date_range = pd.date_range('2012–2–5 00:00:00', '2014–2–8 23:30:00', freq='30Min') ref_df = pd.DataFrame(np.random.randint(1, 20, (ref_date_range.shape[0], 1)))
ref_df.alphabetize = ref_date_range # check for missing datetimeindex values based on reference index (with all values)
missing_dates = ref_df.index[~ref_df.index.isin(df.index)] missing_dates >>DatetimeIndex(['2013-09-09 23:00:00', '2013-09-09 23:30:00',
'2013-09-10 00:00:00', '2013-09-ten 00:30:00'],
dtype='datetime64[ns]', freq='30T')
Dissever a dataframe based on a date in a datetime cavalcade
split_date = pd.datetime(2014,2,ii) df_train = df.loc[df['day_time'] < split_date]
df_test = df.loc[df['day_time'] >= split_date]
Observe nearest date in dataframe (here nosotros assume alphabetize is a datetime field)
dt = pd.to_datetime("2016–04–23 11:00:00") df.alphabetize.get_loc(dt, method="nearest") #get index date
idx = df.index[df.alphabetize.get_loc(dt, method='nearest')] #row to series
southward = df.iloc[df.alphabetize.get_loc(dt, method='nearest')]
Calculate a delta betwixt datetimes in rows (assuming alphabetize is datetime)
df['t_val'] = df.index df['delta'] = (df['t_val']-df['t_val'].shift()).fillna(0)
Summate a running delta between engagement column and a given date (eg here we utilize commencement engagement in the date column as the engagement nosotros desire to difference to).
dt = pd.to_datetime(str(train_df['date'].iloc[0])) dt
>>Timestamp('2016-01-10 00:00:00') train_df['elapsed']=pd.Series(delta.seconds for delta in (train_df['date'] - dt)) #convert seconds to hours
train_df['elapsed'] = train_df['elapsed'].apply(lambda x: x/3600)
Housekeeping
Reset alphabetize
data
day_time
2014-02-02 0.45
2014-02-02 0.41 df.reset_index(inplace=Truthful) day_time data
0 2014-02-02 0.45
0 2014-02-02 0.41 #to drop information technology
df.reset_index(driblet=True, inplace=True)
Set up index
df = df.set_index("day_time")
Reset alphabetize, don't keep the original alphabetize
df = df.reset_index(drop=True)
Drop cavalcade(south)
df.drop(columns=['col_to_drop','other_col_to_drop'],inplace=Truthful)
Drop rows that comprise a duplicate value in a specific column(due south)
df=df.drop_duplicates(subset=['id'])
Rename column(south)
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
Sort dataframe past column_1 then column_2, in ascending society
df.sort_values(by=['column_1', 'column_2']) #descending
df.sort_values(by='column_1', ascending=0)
Sort dataframe using a list (sorts by column 'id' using given list social club of id's)
ids_sort_by=[34g,56gf,2w,34nb] df['id_cat'] = pd.Categorical(
df['id'],
categories=ids_sort_by,
ordered=True
) df=df.sort_values('id_cat')
Selection
Select rows from a DataFrame based on values in a column in pandas
Super useful snippets after https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-cavalcade-in-pandas
df.loc[df['column_name'] == some_value] df.loc[df['column_name'].isin(some_values)] df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]
Select columns from dataframe
df1 = df[['a','b']]
Get unique values in a column
acorns = df.Acorn.unique() #aforementioned as
acorns = df['Acorn'].unique()
Get row where value in column is a minimum
lowest_row = df.iloc[df['column_1'].argmin()]
Select by row number
my_series = df.iloc[0]
my_df = df.iloc[[0]]
Select past column number
df.iloc[:,0]
Get column names for maximum value in each row
classes=df.idxmax(axis=1)
Select 70% of Dataframe rows
df_n = df.sample(frac=0.7)
Randomly select n rows from a Dataframe
df_n = df.sample(n=twenty)
Select rows where a column doesn't (remove tilda for does) contain a substring
df[~df['name'].str.contains("mouse")]
Select rows containing a substring from a list of substrings
#eg electric current df['id'] consists of ['23_a', '23_b','45_1','45_2']
core_ds=['23','45']
df=df[df['id'].str.contains('|'.join(core_ids))]
Select duplicated rows based on selected columns
dup_df=df_loss[df_loss.duplicated(['id','model'])]
Select duplicated rows based on all columns (returns all except offset occurrence)
dup_df=df_loss[df_loss.duplicated()]
Select using query and then set value for specific column. In the example below we search the dataframe on the 'isle' column and 'vegetation' column, and for the matches we set the 'biodiversity' column to 'low'
df.loc[(df['isle'] == 'zanzibar') & (df['vegetation'] == 'cleared'), ['biodiversity']]='depression'
Create 10 fold Train/Exam Splits
Similar to selecting to a % of dataframe rows, nosotros can repeat randomly to create ten fold train/test gear up splits using a ninety/10 train examination split ratio.
#tt_splits.py import pandas as pdDATA_PATH=some_path_to_data
train_df = pd.read_csv(DATA_PATH+'railroad train.csv')def create_folds():
train_dfs=[]
val_dfs = []
for n in range(10):
train_n_df = train_df.sample(frac=0.ix).copy()
val_n_df=train_df[~train_df.isin(train_n_df)].dropna()
train_dfs.append(train_n_df)
val_dfs.append(val_n_df)
return train_dfs,val_dfsdef write_folds(train_dfs,val_dfs):
i=0
for t,5 in zilch(train_dfs, val_dfs):
t.to_csv(DATA_PATH+f'fold_{i}_train.csv', alphabetize=Faux)
v.to_csv(DATA_PATH+f'fold_{i}_test.csv', index=False)
i+=1if __name__ == '__main__':
train_dfs,val_dfs = create_folds()
write_folds(train_dfs, val_dfs)
Group past
Grouping by columns, get nigh mutual occurrence of string in other column (eg form predictions on different runs of a model).
#id model_name pred
#34g4 resnet50 car
#34g4 resnet50 autobus mode_df=temp_df.groupby(['id', 'model_name'])['pred'].agg(pd.Series.mode) .to_frame()
Group by cavalcade, apply operation so catechumen effect to dataframe
df = df(['LCLid']).hateful().reset_index()
Replacement
Supplant rows in dataframe with rows from another dataframe with same alphabetize.
#for example first I created a new dataframe based on a selection df_b = df_a.loc[df_a['machine_id'].isnull()] #replace column with value from another cavalcade for i in df_b.alphabetize:
df_b.at[i, 'machine_id'] = df_b.at[i, 'box_id'] #now replace rows in original dataframe df_a.loc[df_b.index] = df_b
Replace value in cavalcade(south) by row alphabetize
df.loc[0:2,'col'] = 42
Supplant substring in cavalcade. See pandas docs for regex apply
pred_df['id'] = pred_df['id'].str.replace('_raw', '')
Iterate over rows
Using iterrows
for alphabetize, row in df.iterrows():
print (row["type"], row["value"])
Using itertuples (faster, see https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas)
for row in df.itertuples():
print (getattr(row, "blazon"), getattr(row, "value"))
If you lot demand to modify the rows you are iterating, use apply:
def my_fn(c):
return c + anedf['plus_one'] = df.apply(lambda row: my_fn(row['value']), axis=i)
Or alternatively, run into proficient explanation here and example beneath:
for i in df.index:
if <something>:
df.at[i, 'ifor'] = x
else:
df.at[i, 'ifor'] = y
NaN's
Replace NaN in df or column with zeros (or value)
df.fillna(0) df['some_column'].fillna(0, inplace=True)
Count NaN'southward in column
df['energy(kWh/hh)'].isna().sum()
Find which columns have Nans, list of those columns, and select columns with ane or more NaNs. After https://stackoverflow.com/questions/36226083/how-to-observe-which-columns-contain-whatever-nan-value-in-pandas-dataframe-python
#which cols have nan
df.isna().whatsoever()
#listing of cols with nan
df.columns[df.isna().any()].tolist()
#select cols with nan
df.loc[:, df.isna().any()]
Get rows where column is NaN
df[df['Col2'].isnull()]
Data Analysis
Show concluding n rows of dataframe
df.tail(n=2)
Prove transpose of dataframe head. we laissez passer in len(list(df)) as number to head to show all the columns
df.head().T.caput(len(listing(df))) >> 0 ane ii iii 4
index 2012-02-05 00:00:00 2012-02-05 00:00:00 2012-02-05 00:00:00 2012-02-05 00:00:00 2012-02-05 00:00:00
LCLid MAC000006 MAC005178 MAC000066 MAC004510 MAC004882
free energy(kWh/hh) 0.042 0.561 0.037 0.254 0.426
dayYear 2012 2012 2012 2012 2012
dayMonth 2 2 2 2 two
dayWeek 5 5 v 5 v
dayDay 5 5 5 5 v
dayDayofweek 6 vi 6 6 six
dayDayofyear 36 36 36 36 36
Summate the mean for each cell beyond multiple dataframes
df_concat = pd.concat((df_1, df_2, df_3, df_4))
by_row_index = df_concat.groupby(df_concat.alphabetize)
df_means = by_row_index.hateful()
Calculate sum across all columns for each row
df_means['Sum'] = df_means.sum(axis=1)
String operations
Replace a specific character in column
df['bankHoliday'] = df['bankHoliday'].str.replace('?','')
Concatenate two columns
df['concat'] = df["id"].astype(str) + '-' + df["name"]
Merge
Merge DataFrame on multiple columns
df = pd.merge(X, y, on=['city','yr','weekofyear'])
Concat / Append vertically
(see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
df = df1.suspend(df2, ignore_index=Truthful) #or frames = [df1, df2, df3]result = pd.concat(frames)
Split
Split dataframe into Northward roughly equal sized dataframes
idxs=df.index.values chunked = np.array_split(idxs, NUM_CORES) for clamper in chunked:
part_df = df.loc[df.index.isin(clamper)] #run some process on the function
p= Process(target=proc_chunk, args=[part_df])
jobs.append(p)
p.start()
Split a column string by the final occurrence of a substring, create new columns
df[['model_name','run']] = df.model.str.rsplit(pat="-", n=one, aggrandize = True)
Type conversion
Change column type in dataframe
df_test[['value']] = df_test[['value']].astype(int)
Add together data
Add an empty column
df["nan_column"] = np.nan df["zero_column"] = 0
Data types
Convert columns 'a' and 'b' to numeric, coerce non numeric to 'NaN'
df[['a', 'b']] = df[['a', 'b']].apply(pd.to_numeric, errors='coerce')
Creating DataFrames
From listing of dicts
df = pd.DataFrame([sig_dict, id_dict, phase_dict, target_dict]) df=df.T df.columns=['indicate','id','phase','target']
From listing
missing=['domestic dog','cat','frog']
df=pd.DataFrame({"missing":missing})
From multiple lists
df=pd.DataFrame(list(zip(mylist1, mylist2, mylist3)),
columns=['title1','title2', 'title3'])
Numpy
Every bit an alternative method to concatenating dataframes, you lot can use numpy (less retentivity intensive than pandas-useful for big merges)
a = np.assortment([[1, 2], [3, 4]])
b = np.array([[five, vi], [7,viii]])
a, b >(array([[1, ii],
[3, 4]]), array([[5, vi],
[7, 8]])) c=np.concatenate((a, b), axis=1)
c >array([[1, 2, v, 6],
[three, 4, seven, eight]]) df = pd.DataFrame(c)
df.head() >0 ane ii three
0 i 2 v 6
1 3 four 7 viii for i in range(ten):
df = pq.read_table(path+f'df_{i}.parquet').to_pandas()
vals = df.values
if i > 0:
#axis=ane to concat horizontally np_vals = np.concatenate((np_vals, vals), axis=1)
else:
np_vals=vals
np.savetxt(path+f'df_np.csv', np_vals, delimiter=",")
Import/Export
Group by a cavalcade, then consign each group into a separate dataframe
f = lambda ten: ten.to_csv("{ane}.csv".format(x.proper noun.lower()), index=False)
df.groupby('LCLid').employ(f) #for instance our original dataframe may exist: day_time LCLid energy(kWh/hh)
289 2012–02–05 00:00:00 MAC004954 0.45
289 2012–02–05 00:xxx:00 MAC004954 0.46
6100 2012–02–05 05:30:00 MAC000041 0.23
Import / Export in Feather format
Here we save a DataFrame in feather format (really fast to read back in). Note I have an issue saving feather files >~2GB using pandas==0.23.4
df.to_feather('df_data.feather') import feather as ftr df = ftr.read_dataframe('df_data.feather')
Import / Consign in Parquet format
import pyarrow.parquet as pq df.to_parquet("data.parquet") df = pq.read_table("information.parquet").to_pandas()
Save without index
df.to_csv('file.csv', index=False)
Read in, specifying new column names
df = pd.read_csv('signals.csv', names=['stage', 'aamplitude'])
datetime64 Engagement and Fourth dimension Codes
from here: https://docs.scipy.org/doctor/numpy/reference/arrays.datetime.html
Code Meaning
Y yr
M calendar month
Due west calendar week
D day
time units:
Lawmaking Meaning
h hour
m infinitesimal
s second
ms millisecond
united states of america microsecond
ns nanosecond
ps picosecond
fs femtosecond
as attosecond
Source: https://adriangcoder.medium.com/pandas-tricks-and-tips-a7b87c3748ea
0 Response to "Pandas Remove Duplicate Timestamps on Read Csv and Reindex"
Post a Comment