Pandas Remove Duplicate Timestamps on Read Csv and Reindex

Pandas for time serial data — tricks and tips

There are some Pandas DataFrame manipulations that I keep looking up how to do. I am recording these hither to save myself time. These may assistance you too.

Time series data

Convert cavalcade to datetime with given format

          df['day_time'] = pd.to_datetime(df['day_time'], format='%Y-%m-%d %H:%1000:%S')          0 2012–10–12 00:00:00
1 2012–x–12 00:30:00
2 2012–ten–12 01:00:00
three 2012–10–12 01:30:00

Re-index a dataframe to interpolate missing values (eg every 30 mins below). You lot need to have a datetime index on the df before running this.

          full_idx = pd.date_range(start=df['day_time'].min(), end=df['day_time'].max(), freq='30T')          df = (
            df
            .groupby('LCLid', as_index=False)            
            .employ(lambda group: group.reindex(full_idx, method='nearest'))            
            .reset_index(level=0, driblet=Truthful)            
            .sort_index()            
)

Notice missing dates in a DataFrame

          # Note date_range is inclusive of the finish date
ref_date_range = pd.date_range('2012–2–5 00:00:00', '2014–2–8 23:30:00', freq='30Min')          ref_df = pd.DataFrame(np.random.randint(1, 20, (ref_date_range.shape[0], 1)))
ref_df.alphabetize = ref_date_range          # check for missing datetimeindex values based on reference index (with all values)
missing_dates = ref_df.index[~ref_df.index.isin(df.index)]          missing_dates          >>DatetimeIndex(['2013-09-09 23:00:00', '2013-09-09 23:30:00',
            '2013-09-10 00:00:00', '2013-09-ten 00:30:00'],
            dtype='datetime64[ns]', freq='30T')

Dissever a dataframe based on a date in a datetime cavalcade

          split_date = pd.datetime(2014,2,ii)          df_train = df.loc[df['day_time'] < split_date]
df_test = df.loc[df['day_time'] >= split_date]

Observe nearest date in dataframe (here nosotros assume alphabetize is a datetime field)

          dt = pd.to_datetime("2016–04–23 11:00:00")          df.alphabetize.get_loc(dt, method="nearest")          #get index date
idx = df.index[df.alphabetize.get_loc(dt, method='nearest')]          #row to series
southward = df.iloc[df.alphabetize.get_loc(dt, method='nearest')]

Calculate a delta betwixt datetimes in rows (assuming alphabetize is datetime)

          df['t_val'] = df.index          df['delta'] = (df['t_val']-df['t_val'].shift()).fillna(0)

Summate a running delta between engagement column and a given date (eg here we utilize commencement engagement in the date column as the engagement nosotros desire to difference to).

          dt = pd.to_datetime(str(train_df['date'].iloc[0]))          dt
>>Timestamp('2016-01-10 00:00:00')          train_df['elapsed']=pd.Series(delta.seconds for delta in (train_df['date'] - dt))          #convert seconds to hours
train_df['elapsed'] = train_df['elapsed'].apply(lambda x: x/3600)

Housekeeping

Reset alphabetize

                      data
day_time
2014-02-02  0.45
2014-02-02  0.41          df.reset_index(inplace=Truthful)                      day_time    data
0 2014-02-02  0.45
0 2014-02-02  0.41          #to drop information technology
df.reset_index(driblet=True, inplace=True)

Set up index

          df = df.set_index("day_time")

Reset alphabetize, don't keep the original alphabetize

          df = df.reset_index(drop=True)

Drop cavalcade(south)

          df.drop(columns=['col_to_drop','other_col_to_drop'],inplace=Truthful)

Drop rows that comprise a duplicate value in a specific column(due south)

          df=df.drop_duplicates(subset=['id'])

Rename column(south)

          df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

Sort dataframe past column_1 then column_2, in ascending society

          df.sort_values(by=['column_1', 'column_2'])          #descending
df.sort_values(by='column_1', ascending=0)

Sort dataframe using a list (sorts by column 'id' using given list social club of id's)

          ids_sort_by=[34g,56gf,2w,34nb]          df['id_cat'] = pd.Categorical(
            df['id'],            
            categories=ids_sort_by,            
            ordered=True
)          df=df.sort_values('id_cat')

Selection

Select rows from a DataFrame based on values in a column in pandas

Super useful snippets after https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-cavalcade-in-pandas

          df.loc[df['column_name'] == some_value]          df.loc[df['column_name'].isin(some_values)]          df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]

Select columns from dataframe

          df1 = df[['a','b']]

Get unique values in a column

          acorns = df.Acorn.unique()          #aforementioned as
acorns = df['Acorn'].unique()

Get row where value in column is a minimum

          lowest_row = df.iloc[df['column_1'].argmin()]

Select by row number

          my_series = df.iloc[0]
my_df = df.iloc[[0]]

Select past column number

          df.iloc[:,0]

Get column names for maximum value in each row

          classes=df.idxmax(axis=1)

Select 70% of Dataframe rows

          df_n = df.sample(frac=0.7)

Randomly select n rows from a Dataframe

          df_n = df.sample(n=twenty)

Select rows where a column doesn't (remove tilda for does) contain a substring

          df[~df['name'].str.contains("mouse")]

Select rows containing a substring from a list of substrings

          #eg electric current df['id'] consists of ['23_a', '23_b','45_1','45_2']
core_ds=['23','45']
df=df[df['id'].str.contains('|'.join(core_ids))]

Select duplicated rows based on selected columns

          dup_df=df_loss[df_loss.duplicated(['id','model'])]

Select duplicated rows based on all columns (returns all except offset occurrence)

          dup_df=df_loss[df_loss.duplicated()]

Select using query and then set value for specific column. In the example below we search the dataframe on the 'isle' column and 'vegetation' column, and for the matches we set the 'biodiversity' column to 'low'

          df.loc[(df['isle'] == 'zanzibar') & (df['vegetation'] == 'cleared'), ['biodiversity']]='depression'

Create 10 fold Train/Exam Splits

Similar to selecting to a % of dataframe rows, nosotros can repeat randomly to create ten fold train/test gear up splits using a ninety/10 train examination split ratio.

          #tt_splits.py          import pandas as pdDATA_PATH=some_path_to_data
              train_df = pd.read_csv(DATA_PATH+'railroad train.csv')
            def create_folds():
              train_dfs=[]
              val_dfs = []
              for n in range(10):
              train_n_df = train_df.sample(frac=0.ix).copy()
              val_n_df=train_df[~train_df.isin(train_n_df)].dropna()
              train_dfs.append(train_n_df)
              val_dfs.append(val_n_df)
              return train_dfs,val_dfs
            def write_folds(train_dfs,val_dfs):
              i=0
              for t,5 in zilch(train_dfs, val_dfs):
              t.to_csv(DATA_PATH+f'fold_{i}_train.csv', alphabetize=Faux)
              v.to_csv(DATA_PATH+f'fold_{i}_test.csv', index=False)
              i+=1
            if __name__ ==              '__main__':
              train_dfs,val_dfs = create_folds()
              write_folds(train_dfs, val_dfs)

Group past

Grouping by columns, get nigh mutual occurrence of string in other column (eg form predictions on different runs of a model).

          #id   model_name  pred
#34g4 resnet50    car
#34g4 resnet50    autobus          mode_df=temp_df.groupby(['id', 'model_name'])['pred'].agg(pd.Series.mode)            .to_frame()

Group by cavalcade, apply operation so catechumen effect to dataframe

          df = df(['LCLid']).hateful().reset_index()

Replacement

Supplant rows in dataframe with rows from another dataframe with same alphabetize.

          #for example first I created a new dataframe based on a selection          df_b = df_a.loc[df_a['machine_id'].isnull()]          #replace column with value from another cavalcade          for i in df_b.alphabetize:
            df_b.at[i, 'machine_id'] = df_b.at[i, 'box_id']          #now replace rows in original dataframe          df_a.loc[df_b.index] = df_b

Replace value in cavalcade(south) by row alphabetize

          df.loc[0:2,'col'] = 42

Supplant substring in cavalcade. See pandas docs for regex apply

          pred_df['id'] = pred_df['id'].str.replace('_raw', '')

Iterate over rows

Using iterrows

                      for            alphabetize, row            in            df.iterrows():
            print (row["type"], row["value"])

Using itertuples (faster, see https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas)

                      for            row            in            df.itertuples():
            print (getattr(row, "blazon"), getattr(row, "value"))

If you lot demand to modify the rows you are iterating, use apply:

                      def            my_fn(c):
            return            c + anedf['plus_one'] = df.apply(lambda              row: my_fn(row['value']), axis=i)

Or alternatively, run into proficient explanation here and example beneath:

          for i in df.index:
            if <something>:
            df.at[i, 'ifor'] = x
            else:
            df.at[i, 'ifor'] = y

NaN's

Replace NaN in df or column with zeros (or value)

          df.fillna(0)          df['some_column'].fillna(0, inplace=True)

Count NaN'southward in column

          df['energy(kWh/hh)'].isna().sum()

Find which columns have Nans, list of those columns, and select columns with ane or more NaNs. After https://stackoverflow.com/questions/36226083/how-to-observe-which-columns-contain-whatever-nan-value-in-pandas-dataframe-python

          #which cols have nan
df.isna().whatsoever()
                    #listing of cols with nan
df.columns[df.isna().any()].tolist()
                    #select cols with nan
df.loc[:, df.isna().any()]

Get rows where column is NaN

          df[df['Col2'].isnull()]

Data Analysis

Show concluding n rows of dataframe

          df.tail(n=2)

Prove transpose of dataframe head. we laissez passer in len(list(df)) as number to head to show all the columns

          df.head().T.caput(len(listing(df)))          >>             0  ane  ii  iii  4
index  2012-02-05 00:00:00  2012-02-05 00:00:00  2012-02-05 00:00:00  2012-02-05 00:00:00  2012-02-05 00:00:00
LCLid  MAC000006  MAC005178  MAC000066  MAC004510  MAC004882
free energy(kWh/hh)  0.042  0.561  0.037  0.254  0.426
dayYear  2012  2012  2012  2012  2012
dayMonth  2  2  2  2  two
dayWeek  5  5  v  5  v
dayDay  5  5  5  5  v
dayDayofweek  6  vi  6  6  six
dayDayofyear  36  36  36  36  36

Summate the mean for each cell beyond multiple dataframes

          df_concat = pd.concat((df_1, df_2, df_3, df_4))
by_row_index = df_concat.groupby(df_concat.alphabetize)
df_means = by_row_index.hateful()

Calculate sum across all columns for each row

          df_means['Sum'] = df_means.sum(axis=1)

String operations

Replace a specific character in column

                      
df['bankHoliday'] = df['bankHoliday'].str.replace('?','')

Concatenate two columns

          df['concat'] = df["id"].astype(str) + '-' + df["name"]

Merge

Merge DataFrame on multiple columns

          df = pd.merge(X, y, on=['city','yr','weekofyear'])

Concat / Append vertically

(see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

          df = df1.suspend(df2, ignore_index=Truthful)          #or                    frames = [df1, df2, df3]              result = pd.concat(frames)

Split

Split dataframe into Northward roughly equal sized dataframes

          idxs=df.index.values          chunked = np.array_split(idxs, NUM_CORES)          for clamper in chunked:
            part_df = df.loc[df.index.isin(clamper)]                      #run some process on the function
            p= Process(target=proc_chunk, args=[part_df])
            jobs.append(p)
            p.start()

Split a column string by the final occurrence of a substring, create new columns

          df[['model_name','run']] = df.model.str.rsplit(pat="-", n=one, aggrandize = True)

Type conversion

Change column type in dataframe

          df_test[['value']] = df_test[['value']].astype(int)

Add together data

Add an empty column

          df["nan_column"] = np.nan          df["zero_column"] = 0

Data types

Convert columns 'a' and 'b' to numeric, coerce non numeric to 'NaN'

          df[['a', 'b']] = df[['a', 'b']].apply(pd.to_numeric, errors='coerce')

Creating DataFrames

From listing of dicts

          df = pd.DataFrame([sig_dict, id_dict, phase_dict, target_dict])          df=df.T          df.columns=['indicate','id','phase','target']

From listing

          missing=['domestic dog','cat','frog']
df=pd.DataFrame({"missing":missing})

From multiple lists

          df=pd.DataFrame(list(zip(mylist1, mylist2, mylist3)),
            columns=['title1','title2', 'title3'])

Numpy

Every bit an alternative method to concatenating dataframes, you lot can use numpy (less retentivity intensive than pandas-useful for big merges)

          a = np.assortment([[1, 2], [3, 4]])
b = np.array([[five, vi], [7,viii]])
a, b          >(array([[1, ii],
            [3, 4]]), array([[5, vi],
            [7, 8]]))          c=np.concatenate((a, b), axis=1)
c          >array([[1, 2, v, 6],
            [three, 4, seven, eight]])          df = pd.DataFrame(c)
df.head()          >0  ane  ii  three
            0  i  2  v  6
            1  3  four  7  viii          for i in range(ten):
            df = pq.read_table(path+f'df_{i}.parquet').to_pandas()
            vals = df.values
            if i > 0:
            #axis=ane to concat horizontally                      np_vals = np.concatenate((np_vals, vals), axis=1)
            else:
            np_vals=vals
np.savetxt(path+f'df_np.csv', np_vals, delimiter=",")

Import/Export

Group by a cavalcade, then consign each group into a separate dataframe

          f = lambda ten: ten.to_csv("{ane}.csv".format(x.proper noun.lower()), index=False)
df.groupby('LCLid').employ(f)          #for instance our original dataframe may exist:          day_time            LCLid      energy(kWh/hh)            
289  2012–02–05 00:00:00 MAC004954 0.45            
289  2012–02–05 00:xxx:00 MAC004954 0.46
6100 2012–02–05 05:30:00 MAC000041 0.23

Import / Export in Feather format

Here we save a DataFrame in feather format (really fast to read back in). Note I have an issue saving feather files >~2GB using pandas==0.23.4

          df.to_feather('df_data.feather')          import feather as ftr          df = ftr.read_dataframe('df_data.feather')

Import / Consign in Parquet format

          import pyarrow.parquet as pq          df.to_parquet("data.parquet")          df = pq.read_table("information.parquet").to_pandas()

Save without index

          df.to_csv('file.csv', index=False)

Read in, specifying new column names

          df = pd.read_csv('signals.csv', names=['stage', 'aamplitude'])

datetime64 Engagement and Fourth dimension Codes

from here: https://docs.scipy.org/doctor/numpy/reference/arrays.datetime.html

          Code Meaning            
Y    yr            
M    calendar month
Due west    calendar week
D    day

time units:

          Lawmaking Meaning
h    hour
m    infinitesimal
s    second
ms   millisecond
united states of america   microsecond
ns   nanosecond
ps   picosecond
fs   femtosecond
as   attosecond

woodhationge1967.blogspot.com

Source: https://adriangcoder.medium.com/pandas-tricks-and-tips-a7b87c3748ea