Pandas Remove Duplicate Timestamps on Read Csv and Reindex

Pandas for time serial data — tricks and tips

There are some Pandas DataFrame manipulations that I keep looking up how to do. I am recording these hither to save myself time. These may assistance you too.

Convert cavalcade to datetime with given format

          df['day_time'] = pd.to_datetime(df['day_time'], format='%Y-%m-%d %H:%1000:%S')          0 2012–10–12 00:00:00
1 2012–x–12 00:30:00
2 2012–ten–12 01:00:00
three 2012–10–12 01:30:00

Re-index a dataframe to interpolate missing values (eg every 30 mins below). You lot need to have a datetime index on the df before running this.

          full_idx = pd.date_range(start=df['day_time'].min(), end=df['day_time'].max(), freq='30T')          df = (
df
.groupby('LCLid', as_index=False)
.employ(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, driblet=Truthful)
.sort_index()
)

Notice missing dates in a DataFrame

          # Note date_range is inclusive of the finish date
ref_date_range = pd.date_range('2012–2–5 00:00:00', '2014–2–8 23:30:00', freq='30Min')
ref_df = pd.DataFrame(np.random.randint(1, 20, (ref_date_range.shape[0], 1)))
ref_df.alphabetize = ref_date_range
# check for missing datetimeindex values based on reference index (with all values)
missing_dates = ref_df.index[~ref_df.index.isin(df.index)]
missing_dates >>DatetimeIndex(['2013-09-09 23:00:00', '2013-09-09 23:30:00',
'2013-09-10 00:00:00', '2013-09-ten 00:30:00'],
dtype='datetime64[ns]', freq='30T')

Dissever a dataframe based on a date in a datetime cavalcade

          split_date = pd.datetime(2014,2,ii)          df_train = df.loc[df['day_time'] < split_date]
df_test = df.loc[df['day_time'] >= split_date]

Observe nearest date in dataframe (here nosotros assume alphabetize is a datetime field)

          dt = pd.to_datetime("2016–04–23 11:00:00")          df.alphabetize.get_loc(dt, method="nearest")          #get index date
idx = df.index[df.alphabetize.get_loc(dt, method='nearest')]
#row to series
southward = df.iloc[df.alphabetize.get_loc(dt, method='nearest')]

Calculate a delta betwixt datetimes in rows (assuming alphabetize is datetime)

          df['t_val'] = df.index          df['delta'] = (df['t_val']-df['t_val'].shift()).fillna(0)        

Summate a running delta between engagement column and a given date (eg here we utilize commencement engagement in the date column as the engagement nosotros desire to difference to).

          dt = pd.to_datetime(str(train_df['date'].iloc[0]))          dt
>>Timestamp('2016-01-10 00:00:00')
train_df['elapsed']=pd.Series(delta.seconds for delta in (train_df['date'] - dt)) #convert seconds to hours
train_df['elapsed'] = train_df['elapsed'].apply(lambda x: x/3600)

Reset alphabetize

                      data
day_time
2014-02-02 0.45
2014-02-02 0.41
df.reset_index(inplace=Truthful) day_time data
0 2014-02-02 0.45
0 2014-02-02 0.41
#to drop information technology
df.reset_index(driblet=True, inplace=True)

Set up index

          df = df.set_index("day_time")        

Reset alphabetize, don't keep the original alphabetize

          df = df.reset_index(drop=True)        

Drop cavalcade(south)

          df.drop(columns=['col_to_drop','other_col_to_drop'],inplace=Truthful)        

Drop rows that comprise a duplicate value in a specific column(due south)

          df=df.drop_duplicates(subset=['id'])        

Rename column(south)

          df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)        

Sort dataframe past column_1 then column_2, in ascending society

          df.sort_values(by=['column_1', 'column_2'])          #descending
df.sort_values(by='column_1', ascending=0)

Sort dataframe using a list (sorts by column 'id' using given list social club of id's)

          ids_sort_by=[34g,56gf,2w,34nb]          df['id_cat'] = pd.Categorical(
df['id'],
categories=ids_sort_by,
ordered=True
)
df=df.sort_values('id_cat')

Selection

Select rows from a DataFrame based on values in a column in pandas

Super useful snippets after https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-cavalcade-in-pandas

          df.loc[df['column_name'] == some_value]          df.loc[df['column_name'].isin(some_values)]          df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]        

Select columns from dataframe

          df1 = df[['a','b']]        

Get unique values in a column

          acorns = df.Acorn.unique()          #aforementioned as
acorns = df['Acorn'].unique()

Get row where value in column is a minimum

          lowest_row = df.iloc[df['column_1'].argmin()]        

Select by row number

          my_series = df.iloc[0]
my_df = df.iloc[[0]]

Select past column number

          df.iloc[:,0]        

Get column names for maximum value in each row

          classes=df.idxmax(axis=1)        

Select 70% of Dataframe rows

          df_n = df.sample(frac=0.7)        

Randomly select n rows from a Dataframe

          df_n = df.sample(n=twenty)        

Select rows where a column doesn't (remove tilda for does) contain a substring

          df[~df['name'].str.contains("mouse")]        

Select rows containing a substring from a list of substrings

          #eg electric current df['id'] consists of ['23_a', '23_b','45_1','45_2']
core_ds=['23','45']
df=df[df['id'].str.contains('|'.join(core_ids))]

Select duplicated rows based on selected columns

          dup_df=df_loss[df_loss.duplicated(['id','model'])]        

Select duplicated rows based on all columns (returns all except offset occurrence)

          dup_df=df_loss[df_loss.duplicated()]        

Select using query and then set value for specific column. In the example below we search the dataframe on the 'isle' column and 'vegetation' column, and for the matches we set the 'biodiversity' column to 'low'

          df.loc[(df['isle'] == 'zanzibar') & (df['vegetation'] == 'cleared'), ['biodiversity']]='depression'        

Similar to selecting to a % of dataframe rows, nosotros can repeat randomly to create ten fold train/test gear up splits using a ninety/10 train examination split ratio.

          #tt_splits.py          import pandas as pd

DATA_PATH=some_path_to_data
train_df = pd.read_csv(DATA_PATH+'railroad train.csv')

def create_folds():
train_dfs=[]
val_dfs = []
for n in range(10):
train_n_df = train_df.sample(frac=0.ix).copy()
val_n_df=train_df[~train_df.isin(train_n_df)].dropna()
train_dfs.append(train_n_df)
val_dfs.append(val_n_df)
return train_dfs,val_dfs

def write_folds(train_dfs,val_dfs):
i=0
for t,5 in zilch(train_dfs, val_dfs):
t.to_csv(DATA_PATH+f'fold_{i}_train.csv', alphabetize=Faux)
v.to_csv(DATA_PATH+f'fold_{i}_test.csv', index=False)
i+=1

if __name__ == '__main__':
train_dfs,val_dfs = create_folds()
write_folds(train_dfs, val_dfs)

Grouping by columns, get nigh mutual occurrence of string in other column (eg form predictions on different runs of a model).

          #id   model_name  pred
#34g4 resnet50 car
#34g4 resnet50 autobus
mode_df=temp_df.groupby(['id', 'model_name'])['pred'].agg(pd.Series.mode) .to_frame()

Group by cavalcade, apply operation so catechumen effect to dataframe

          df = df(['LCLid']).hateful().reset_index()        

Supplant rows in dataframe with rows from another dataframe with same alphabetize.

          #for example first I created a new dataframe based on a selection          df_b = df_a.loc[df_a['machine_id'].isnull()]          #replace column with value from another cavalcade          for i in df_b.alphabetize:
df_b.at[i, 'machine_id'] = df_b.at[i, 'box_id']
#now replace rows in original dataframe df_a.loc[df_b.index] = df_b

Replace value in cavalcade(south) by row alphabetize

          df.loc[0:2,'col'] = 42        

Supplant substring in cavalcade. See pandas docs for regex apply

          pred_df['id'] = pred_df['id'].str.replace('_raw', '')        

Using iterrows

                      for            alphabetize, row            in            df.iterrows():
print (row["type"], row["value"])

Using itertuples (faster, see https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas)

                      for            row            in            df.itertuples():
print (getattr(row, "blazon"), getattr(row, "value"))

If you lot demand to modify the rows you are iterating, use apply:

                      def            my_fn(c):
return c + ane

df['plus_one'] = df.apply(lambda row: my_fn(row['value']), axis=i)

Or alternatively, run into proficient explanation here and example beneath:

          for i in df.index:
if <something>:
df.at[i, 'ifor'] = x
else:
df.at[i, 'ifor'] = y

Replace NaN in df or column with zeros (or value)

          df.fillna(0)          df['some_column'].fillna(0, inplace=True)        

Count NaN'southward in column

          df['energy(kWh/hh)'].isna().sum()        

Find which columns have Nans, list of those columns, and select columns with ane or more NaNs. After https://stackoverflow.com/questions/36226083/how-to-observe-which-columns-contain-whatever-nan-value-in-pandas-dataframe-python

          #which cols have nan
df.isna().whatsoever()
#listing of cols with nan
df.columns[df.isna().any()].tolist()
#select cols with nan
df.loc[:, df.isna().any()]

Get rows where column is NaN

          df[df['Col2'].isnull()]        

Show concluding n rows of dataframe

          df.tail(n=2)        

Prove transpose of dataframe head. we laissez passer in len(list(df)) as number to head to show all the columns

          df.head().T.caput(len(listing(df)))          >>             0  ane  ii  iii  4
index 2012-02-05 00:00:00 2012-02-05 00:00:00 2012-02-05 00:00:00 2012-02-05 00:00:00 2012-02-05 00:00:00
LCLid MAC000006 MAC005178 MAC000066 MAC004510 MAC004882
free energy(kWh/hh) 0.042 0.561 0.037 0.254 0.426
dayYear 2012 2012 2012 2012 2012
dayMonth 2 2 2 2 two
dayWeek 5 5 v 5 v
dayDay 5 5 5 5 v
dayDayofweek 6 vi 6 6 six
dayDayofyear 36 36 36 36 36

Summate the mean for each cell beyond multiple dataframes

          df_concat = pd.concat((df_1, df_2, df_3, df_4))
by_row_index = df_concat.groupby(df_concat.alphabetize)
df_means = by_row_index.hateful()

Calculate sum across all columns for each row

          df_means['Sum'] = df_means.sum(axis=1)        

Replace a specific character in column

                      
df['bankHoliday'] = df['bankHoliday'].str.replace('?','')

Concatenate two columns

          df['concat'] = df["id"].astype(str) + '-' + df["name"]        

Merge DataFrame on multiple columns

          df = pd.merge(X, y, on=['city','yr','weekofyear'])        

(see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

          df = df1.suspend(df2, ignore_index=Truthful)          #or                    frames = [df1, df2, df3]

result = pd.concat(frames)

Split dataframe into Northward roughly equal sized dataframes

          idxs=df.index.values          chunked = np.array_split(idxs, NUM_CORES)          for clamper in chunked:
part_df = df.loc[df.index.isin(clamper)]
#run some process on the function
p= Process(target=proc_chunk, args=[part_df])
jobs.append(p)
p.start()

Split a column string by the final occurrence of a substring, create new columns

          df[['model_name','run']] = df.model.str.rsplit(pat="-", n=one, aggrandize = True)        

Change column type in dataframe

          df_test[['value']] = df_test[['value']].astype(int)        

Add an empty column

          df["nan_column"] = np.nan          df["zero_column"] = 0        

Convert columns 'a' and 'b' to numeric, coerce non numeric to 'NaN'

          df[['a', 'b']] = df[['a', 'b']].apply(pd.to_numeric, errors='coerce')        

From listing of dicts

          df = pd.DataFrame([sig_dict, id_dict, phase_dict, target_dict])          df=df.T          df.columns=['indicate','id','phase','target']        

From listing

          missing=['domestic dog','cat','frog']
df=pd.DataFrame({"missing":missing})

From multiple lists

          df=pd.DataFrame(list(zip(mylist1, mylist2, mylist3)),
columns=['title1','title2', 'title3'])

Every bit an alternative method to concatenating dataframes, you lot can use numpy (less retentivity intensive than pandas-useful for big merges)

          a = np.assortment([[1, 2], [3, 4]])
b = np.array([[five, vi], [7,viii]])
a, b
>(array([[1, ii],
[3, 4]]), array([[5, vi],
[7, 8]]))
c=np.concatenate((a, b), axis=1)
c
>array([[1, 2, v, 6],
[three, 4, seven, eight]])
df = pd.DataFrame(c)
df.head()
>0 ane ii three
0 i 2 v 6
1 3 four 7 viii
for i in range(ten):
df = pq.read_table(path+f'df_{i}.parquet').to_pandas()
vals = df.values
if i > 0:
#axis=ane to concat horizontally
np_vals = np.concatenate((np_vals, vals), axis=1)
else:
np_vals=vals
np.savetxt(path+f'df_np.csv', np_vals, delimiter=",")

Group by a cavalcade, then consign each group into a separate dataframe

          f = lambda ten: ten.to_csv("{ane}.csv".format(x.proper noun.lower()), index=False)
df.groupby('LCLid').employ(f)
#for instance our original dataframe may exist: day_time LCLid energy(kWh/hh)
289 2012–02–05 00:00:00 MAC004954 0.45
289 2012–02–05 00:xxx:00 MAC004954 0.46
6100 2012–02–05 05:30:00 MAC000041 0.23

Import / Export in Feather format

Here we save a DataFrame in feather format (really fast to read back in). Note I have an issue saving feather files >~2GB using pandas==0.23.4

          df.to_feather('df_data.feather')          import feather as ftr          df = ftr.read_dataframe('df_data.feather')        

Import / Consign in Parquet format

          import pyarrow.parquet as pq          df.to_parquet("data.parquet")          df = pq.read_table("information.parquet").to_pandas()        

Save without index

          df.to_csv('file.csv', index=False)        

Read in, specifying new column names

          df = pd.read_csv('signals.csv', names=['stage', 'aamplitude'])        

from here: https://docs.scipy.org/doctor/numpy/reference/arrays.datetime.html

          Code Meaning            
Y yr
M calendar month
Due west calendar week
D day

time units:

          Lawmaking Meaning
h hour
m infinitesimal
s second
ms millisecond
united states of america microsecond
ns nanosecond
ps picosecond
fs femtosecond
as attosecond

woodhationge1967.blogspot.com

Source: https://adriangcoder.medium.com/pandas-tricks-and-tips-a7b87c3748ea

0 Response to "Pandas Remove Duplicate Timestamps on Read Csv and Reindex"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel