'data' 태그의 글 목록

« 2025/5 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

10 Minutes from pandas to Koalas on Apache Spark

서버/Python | 2020. 5. 7. 10:49 | Posted by youGom

pandas is a great tool to analyze small datasets on a single machine. When the need for bigger datasets arises, users often choose PySpark. However, the converting code from pandas to PySpark is not easy as PySpark APIs are considerably different from pandas APIs. Koalas makes the learning curve significantly easier by providing pandas-like APIs on the top of PySpark. With Koalas, users can take advantage of the benefits of PySpark with minimal efforts, and thus get to value much faster.

A number of blog posts such as Koalas: Easy Transition from pandas to Apache Spark, How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas, and 10 minutes to Koalas in Koalas official docs have demonstrated the ease of conversion between pandas and Koalas. However, despite having the same APIs, there are subtleties when working in a distributed environment that may not be obvious to pandas users. In addition, only about ~70% of pandas APIs are implemented in Koalas. While the open-source community is actively implementing the remaining pandas APIs in Koalas, users would need to use PySpark to work around. Finally, Koalas also offers its own APIs such as to_spark(), DataFrame.map_in_pandas(), ks.sql(), etc. that can significantly improve user productivity.

Therefore, Koalas is not meant to completely replace the needs for learning PySpark. Instead, Koalas makes learning PySpark much easier by offering pandas-like functions. To be proficient in Koalas, users would need to understand the basics of Spark and some PySpark APIs. In fact, we find that users using Koalas and PySpark interchangeably tend to extract the most value from Koalas.

In particular, two types of users benefit the most from Koalas:

pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Distributed and Partitioned Koalas DataFrame

Even though you can apply the same APIs in Koalas as in pandas, under the hood a Koalas DataFrame is very different from a pandas DataFrame. A Koalas DataFrame is distributed, which means the data is partitioned and computed across different workers. On the other hand, all the data in a pandas DataFrame fits in a single machine. As you will see, this difference leads to different behaviors.

Migration from pandas to Koalas

This section will describe how Koalas supports easy migration from pandas to Koalas with various code examples.

Object Creation

The packages below are customarily imported in order to use Koalas. Technically those packages like numpy or pandas are not necessary, but allow users to utilize Koalas more flexibly.

import numpy as np
import pandas as pd
import databricks.koalas as ks

A Koalas Series can be created by passing a list of values, the same way as a pandas Series. A Koalas Series can also be created by passing a pandas Series.

# Create a pandas Series
pser = pd.Series([1, 3, 5, np.nan, 6, 8]) 
# Create a Koalas Series
kser = ks.Series([1, 3, 5, np.nan, 6, 8])
# Create a Koalas Series by passing a pandas Series
kser = ks.Series(pser)
kser = ks.from_pandas(pser)

Best Practice: As shown below, Koalas does not guarantee the order of indices unlike pandas. This is because almost all operations in Koalas run in a distributed manner. You can use Series.sort_index() if you want ordered indices.

>>> pser
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
>>> kser
3    NaN
2    5.0
1    3.0
5    8.0
0    1.0
4    6.0
Name: 0, dtype: float64
# Apply sort_index() to a Koalas series
>>> kser.sort_index() 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

A Koalas DataFrame can also be created by passing a NumPy array, the same way as a pandas DataFrame. A Koalas DataFrame has an Index unlike PySpark DataFrame. Therefore, Index of the pandas DataFrame would be preserved in the Koalas DataFrame after creating a Koalas DataFrame by passing a pandas DataFrame.

# Create a pandas DataFrame
pdf = pd.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
# Create a Koalas DataFrame
kdf = ks.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
# Create a Koalas DataFrame by passing a pandas DataFrame
kdf = ks.DataFrame(pdf)
kdf = ks.from_pandas(pdf)

Likewise, the order of indices can be sorted by DataFrame.sort_index().

>>> pdf
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712
>>> kdf.sort_index()
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712

Viewing Data

As with a pandas DataFrame, the top rows of a Koalas DataFrame can be displayed using DataFrame.head(). Generally, a confusion can occur when converting from pandas to PySpark due to the different behavior of the head() between pandas and PySpark, but Koalas supports this in the same way as pandas by using limit() of PySpark under the hood.

>>> kdf.head(2)
          A         B
0  0.015869  0.584455
1  0.224340  0.632132

A quick statistical summary of a Koalas DataFrame can be displayed using DataFrame.describe().

>>> kdf.describe()
              A         B
count  5.000000  5.000000
mean   0.344998  0.660481
std    0.360486  0.195485
min    0.015869  0.388611
25%    0.037077  0.584455
50%    0.224340  0.632132
75%    0.637126  0.820495
max    0.810577  0.876712

Sorting a Koalas DataFrame can be done using DataFrame.sort_values().

>>> kdf.sort_values(by='B')
          A         B
3  0.810577  0.388611
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
4  0.037077  0.876712

Transposing a Koalas DataFrame can be done using DataFrame.transpose().

>>> kdf.transpose()
          0         1         2         3         4
A  0.015869  0.224340  0.637126  0.810577  0.037077
B  0.584455  0.632132  0.820495  0.388611  0.876712

Best Practice: DataFrame.transpose() will fail when the number of rows is more than the value of compute.max_rows, which is set to 1000 by default. This is to prevent users from unknowingly executing expensive operations. In Koalas, you can easily reset the default compute.max_rows. See the official docs for DataFrame.transpose() for more details.

>>> from databricks.koalas.config import set_option, get_option
>>> ks.get_option('compute.max_rows')
1000
>>> ks.set_option('compute.max_rows', 2000)
>>> ks.get_option('compute.max_rows')
2000

Selecting or Accessing Data

As with a pandas DataFrame, selecting a single column from a Koalas DataFrame returns a Series.

>>> kdf['A']  # or kdf.A
0    0.015869
1    0.224340
2    0.637126
3    0.810577
4    0.037077
Name: A, dtype: float64

Selecting multiple columns from a Koalas DataFrame returns a Koalas DataFrame.

>>> kdf[['A', 'B']]
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712

Slicing is available for selecting rows from a Koalas DataFrame.

>>> kdf.loc[1:2]
          A         B
1  0.224340  0.632132
2  0.637126  0.820495

Slicing rows and columns is also available.

>>> kdf.iloc[:3, 1:2]
          B
0  0.584455
1  0.632132
2  0.820495

Best Practice: By default, Koalas disallows adding columns coming from different DataFrames or Series to a Koalas DataFrame as adding columns requires join operations which are generally expensive. This operation can be enabled by setting compute.ops_on_diff_frames to True. See Available options in the docs for more detail.

>>> kser = ks.Series([100, 200, 300, 400, 500], index=[0, 1, 2, 3, 4])
>>> kdf['C'] = kser


...
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
# Those are needed for managing options
>>> from databricks.koalas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> kdf['C'] = kser
# Reset to default to avoid potential expensive operation in the future
>>> reset_option("compute.ops_on_diff_frames")
>>> kdf
          A         B    C
0  0.015869  0.584455  100
1  0.224340  0.632132  200
3  0.810577  0.388611  400
2  0.637126  0.820495  300
4  0.037077  0.876712  500

Applying a Python Function to Koalas DataFrame

DataFrame.apply() is a very powerful function favored by many pandas users. Koalas DataFrames also support this function.

>>> kdf.apply(np.cumsum)
          A         B     C
0  0.015869  0.584455   100
1  0.240210  1.216587   300
3  1.050786  1.605198   700
2  1.687913  2.425693  1000
4  1.724990  3.302404  1500

DataFrame.apply() also works for axis = 1 or ‘columns’ (0 or ‘index’ is the default).

>>> kdf.apply(np.cumsum, axis=1)
          A         B           C
0  0.015869  0.600324  100.600324
1  0.224340  0.856472  200.856472
3  0.810577  1.199187  401.199187
2  0.637126  1.457621  301.457621
4  0.037077  0.913788  500.913788

Also, a Python native function can be applied to a Koalas DataFrame.

>>> kdf.apply(lambda x: x ** 2)
          A         B       C
0  0.000252  0.341588   10000
1  0.050329  0.399591   40000
3  0.657035  0.151018  160000
2  0.405930  0.673212   90000
4  0.001375  0.768623  250000

Best Practice: While it works fine as it is, it is recommended to specify the return type hint for Spark’s return type internally when applying user defined functions to a Koalas DataFrame. If the return type hint is not specified, Koalas runs the function once for a small sample to infer the Spark return type which can be fairly expensive.

>>> def square(x) -> ks.Series[np.float64]:
...     return x ** 2
>>> kdf.apply(square)
          A         B         C
0  0.405930  0.673212   90000.0
1  0.001375  0.768623  250000.0
2  0.000252  0.341588   10000.0
3  0.657035  0.151018  160000.0
4  0.050329  0.399591   40000.0

Note that DataFrame.apply() in Koalas does not support global aggregations by its design. However, If the size of data is lower than compute.shortcut_limit, it might work because it uses pandas as a shortcut execution.

# Working properly since size of data <= compute.shortcut_limit (1000)
>>> ks.DataFrame({'A': range(1000)}).apply(lambda col: col.max())
A    999
Name: 0, dtype: int64
# Not working properly since size of data > compute.shortcut_limit (1000)
>>> ks.DataFrame({'A': range(1001)}).apply(lambda col: col.max())
A     165
A     580
A     331
A     497
A     829
A     414
A     746
A     663
A     912
A    1000
A     248
A      82
Name: 0, dtype: int64

Best Practice: In Koalas, compute.shortcut_limit (default = 1000) computes a specified number of rows in pandas as a shortcut when operating on a small dataset. Koalas uses the pandas API directly in some cases when the size of input data is below this threshold. Therefore, setting this limit too high could slow down the execution or even lead to out-of-memory errors. The following code example sets a higher compute.shortcut_limit, which then allows the previous code to work properly. See the Available options for more details.

>>> ks.set_option('compute.shortcut_limit', 1001)
>>> ks.DataFrame({'A': range(1001)}).apply(lambda col: col.max())
A    1000
Name: 0, dtype: int64

Grouping Data

Grouping data by columns is one of the common APIs in pandas. DataFrame.groupby() is available in Koalas as well.

>>> kdf.groupby('A').sum()
                 B    C
A                      
0.224340  0.632132  200
0.637126  0.820495  300
0.015869  0.584455  100
0.810577  0.388611  400
0.037077  0.876712  500

See also grouping data by multiple columns below.

>>> kdf.groupby(['A', 'B']).sum()
                     C
A        B            
0.224340 0.632132  200
0.015869 0.584455  100
0.037077 0.876712  500
0.810577 0.388611  400
0.637126 0.820495  300

Plotting and Visualizing Data

In pandas, DataFrame.plot is a good solution for visualizing data. It can be used in the same way in Koalas.

Note that Koalas leverages approximation for faster rendering. Therefore, the results could be slightly different when the number of data is larger than plotting.max_rows.

See the example below that plots a Koalas DataFrame as a bar chart with DataFrame.plot.bar().

>>> speed = [0.1, 17.5, 40, 48, 52, 69, 88]
>>> lifespan = [2, 8, 70, 1.5, 25, 12, 28]
>>> index = ['snail', 'pig', 'elephant',
...          'rabbit', 'giraffe', 'coyote', 'horse']
>>> kdf = ks.DataFrame({'speed': speed,
...                     'lifespan': lifespan}, index=index)
>>> kdf.plot.bar()

Also, The horizontal bar plot is supported with DataFrame.plot.barh()

>>> kdf.plot.barh()

Make a pie plot using DataFrame.plot.pie().

>>> kdf = ks.DataFrame({'mass': [0.330, 4.87, 5.97],
...                     'radius': [2439.7, 6051.8, 6378.1]},
...                    index=['Mercury', 'Venus', 'Earth'])
>>> kdf.plot.pie(y='mass')

Best Practice: For bar and pie plots, only the top-n-rows are displayed to render more efficiently, which can be set by using option plotting.max_rows.

Make a stacked area plot using DataFrame.plot.area().

>>> kdf = ks.DataFrame({
...     'sales': [3, 2, 3, 9, 10, 6, 3],
...     'signups': [5, 5, 6, 12, 14, 13, 9],
...     'visits': [20, 42, 28, 62, 81, 50, 90],
... }, index=pd.date_range(start='2019/08/15', end='2020/03/09',
...                        freq='M'))
>>> kdf.plot.area()

Make line charts using DataFrame.plot.line().

>>> kdf = ks.DataFrame({'pig': [20, 18, 489, 675, 1776],
...                     'horse': [4, 25, 281, 600, 1900]},
...                    index=[1990, 1997, 2003, 2009, 2014])
>>> kdf.plot.line()

Best Practice: For area and line plots, the proportion of data that will be plotted can be set by plotting.sample_ratio. The default is 1000, or the same as plotting.max_rows. See Available options for details.

Make a histogram using DataFrame.plot.hist()

>>> kdf = pd.DataFrame(
...     np.random.randint(1, 7, 6000),
...     columns=['one'])
>>> kdf['two'] = kdf['one'] + np.random.randint(1, 7, 6000)
>>> kdf = ks.from_pandas(kdf)
>>> kdf.plot.hist(bins=12, alpha=0.5)

Make a scatter plot using DataFrame.plot.scatter()

>>> kdf = ks.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
...                     [6.4, 3.2, 1], [5.9, 3.0, 2]],
...                    columns=['length', 'width', 'species'])
>>> kdf.plot.scatter(x='length', y='width', c='species', colormap='viridis')

Missing Functionalities and Workarounds in Koalas

When working with Koalas, there are a few things to look out for. First, not all pandas APIs are currently available in Koalas. Currently, about ~70% of pandas APIs are available in Koalas. In addition, there are subtle behavioral differences between Koalas and pandas, even if the same APIs are applied. Due to the difference, it would not make sense to implement certain pandas APIs in Koalas. This section discusses common workarounds.

Using pandas APIs via Conversion

When dealing with missing pandas APIs in Koalas, a common workaround is to convert Koalas DataFrames to pandas or PySpark DataFrames, and then apply either pandas or PySpark APIs. Converting between Koalas DataFrames and pandas/PySpark DataFrames is pretty straightforward: DataFrame.to_pandas() and koalas.from_pandas() for conversion to/from pandas; DataFrame.to_spark() and DataFrame.to_koalas() for conversion to/from PySpark. However, if the Koalas DataFrame is too large to fit in one single machine, converting to pandas can cause an out-of-memory error.

Following code snippets shows a simple usage of DataFrame.to_pandas().

>>> kidx = kdf.index
>>> kidx.to_list()

...
PandasNotImplementedError: The method `pd.Index.to_list()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

Best Practice: Index.to_list() raises PandasNotImplementedError. Koalas does not support this because it requires collecting all data into the client (driver node) side. A simple workaround is to convert to pandas using to_pandas().

>>> kidx.to_pandas().to_list()
[0, 1, 2, 3, 4]

Native Support for pandas Objects

Koalas has also made available the native support for pandas objects. Koalas can directly leverage pandas objects as below.

>>> kdf = ks.DataFrame({'A': 1.,
...                     'B': pd.Timestamp('20130102'),
...                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...                     'D': np.array([3] * 4, dtype='int32'),
...                     'F': 'foo'})
>>> kdf
     A          B    C  D    F
0  1.0 2013-01-02  1.0  3  foo
1  1.0 2013-01-02  1.0  3  foo
2  1.0 2013-01-02  1.0  3  foo
3  1.0 2013-01-02  1.0  3  foo

ks.Timestamp() is not implemented yet, and ks.Series() cannot be used in the creation of Koalas DataFrame. In these cases, the pandas native objects pd.Timestamp() and pd.Series() can be used instead.

Distributing a pandas Function in Koalas

In addition, Koalas offers Koalas-specific APIs such as DataFrame.map_in_pandas(), which natively support distributing a given pandas function in Koalas.

>>> i = pd.date_range('2018-04-09', periods=2000, freq='1D1min')
>>> ts = ks.DataFrame({'A': ['timestamp']}, index=i)
>>> ts.between_time('0:15', '0:16')


...
PandasNotImplementedError: The method `pd.DataFrame.between_time()` is not implemented yet.

DataFrame.between_time() is not yet implemented in Koalas. As shown below, a simple workaround is to convert to a pandas DataFrame using to_pandas(), and then applying the function.

>>> ts.to_pandas().between_time('0:15', '0:16')
                             A
2018-04-24 00:15:00  timestamp
2018-04-25 00:16:00  timestamp
2022-04-04 00:15:00  timestamp
2022-04-05 00:16:00  timestamp

However, DataFrame.map_in_pandas() is a better alternative workaround because it does not require moving data into a single client node and potentially causing out-of-memory errors.

>>> ts.map_in_pandas(func=lambda pdf: pdf.between_time('0:15', '0:16'))
                             A
2022-04-04 00:15:00  timestamp
2022-04-05 00:16:00  timestamp
2018-04-24 00:15:00  timestamp
2018-04-25 00:16:00  timestamp

Best Practice: In this way, DataFrame.between_time(), which is a pandas function, can be performed on a distributed Koalas DataFrame because DataFrame.map_in_pandas() executes the given function across multiple nodes. See DataFrame.map_in_pandas().

Using SQL in Koalas

Koalas supports standard SQL syntax with ks.sql() which allows executing Spark SQL query and returns the result as a Koalas DataFrame.

>>> kdf = ks.DataFrame({'year': [1990, 1997, 2003, 2009, 2014],
...                     'pig': [20, 18, 489, 675, 1776],
...                     'horse': [4, 25, 281, 600, 1900]})
>>> ks.sql("SELECT * FROM {kdf} WHERE pig > 100")
   year   pig  horse
0  1990    20      4
1  1997    18     25
2  2003   489    281
3  2009   675    600
4  2014  1776   1900

Also, mixing Koalas DataFrame and pandas DataFrame is supported in a join operation.

>>> pdf = pd.DataFrame({'year': [1990, 1997, 2003, 2009, 2014],
...                     'sheep': [22, 50, 121, 445, 791],
...                     'chicken': [250, 326, 589, 1241, 2118]})
>>> ks.sql('''
...     SELECT ks.pig, pd.chicken
...     FROM {kdf} ks INNER JOIN {pdf} pd
...     ON ks.year = pd.year
...     ORDER BY ks.pig, pd.chicken''')
    pig  chicken
0    18      326
1    20      250
2   489      589
3   675     1241
4  1776     2118

Working with PySpark

You can also apply several PySpark APIs on Koalas DataFrames. PySpark background can make you more productive when working in Koalas. If you know PySpark, you can use PySpark APIs as workarounds when the pandas-equivalent APIs are not available in Koalas. If you feel comfortable with PySpark, you can use many rich features such as the Spark UI, history server, etc.

Conversion from and to PySpark DataFrame

A Koalas DataFrame can be easily converted to a PySpark DataFrame using DataFrame.to_spark(), similar to DataFrame.to_pandas(). On the other hand, a PySpark DataFrame can be easily converted to a Koalas DataFrame using DataFrame.to_koalas(), which extends the Spark DataFrame class.

>>> kdf = ks.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
>>> sdf = kdf.to_spark()
>>> type(sdf)
pyspark.sql.dataframe.DataFrame
>>> sdf.show()
+---+---+
|  A|  B|
+---+---+
|  1| 10|
|  2| 20|
|  3| 30|
|  4| 40|
|  5| 50|
+---+---+

Note that converting from PySpark to Koalas can cause an out-of-memory error when the default index type is sequence. Default index type can be set by compute.default_index_type (default = sequence). If the default index must be the sequence in a large dataset, distributed-sequence should be used.

>>> from databricks.koalas import option_context
>>> with option_context(
...         "compute.default_index_type", "distributed-sequence"):
...     kdf = sdf.to_koalas()
>>> type(kdf)
databricks.koalas.frame.DataFrame
>>> kdf
   A   B
3  4  40
1  2  20
2  3  30
4  5  50
0  1  10

Best Practice: Converting from a PySpark DataFrame to Koalas DataFrame can have some overhead because it requires creating a new default index internally – PySpark DataFrames do not have indices. You can avoid this overhead by specifying the column that can be used as an index column. See the Default Index type for more detail.

>>> sdf.to_koalas(index_col='A')
    B
A    
1  10
2  20
3  30
4  40
5  50

Checking Spark’s Execution Plans

DataFrame.explain() is a useful PySpark API and is also available in Koalas. It can show the Spark execution plans before the actual execution. It helps you understand and predict the actual execution and avoid the critical performance degradation.

from databricks.koalas import option_context

with option_context(
        "compute.ops_on_diff_frames", True,
        "compute.default_index_type", 'distributed'):
    df = ks.range(10) + ks.range(10)
    df.explain()

The command above simply adds two DataFrames with the same values. The result is shown below.

== Physical Plan ==
*(5) Project [...]
+- SortMergeJoin [...], FullOuter
   :- *(2) Sort [...], false, 0
   :  +- Exchange hashpartitioning(...), [id=#]
   :     +- *(1) Project [...]
   :        +- *(1) Range (0, 10, step=1, splits=12)
   +- *(4) Sort [...], false, 0
      +- ReusedExchange [...], Exchange hashpartitioning(...), [id=#]

As shown in the physical plan, the execution will be fairly expensive because it will perform the sort merge join to combine DataFrames. To improve the execution performance, you can reuse the same DataFrame to avoid the merge. See Physical Plans in Spark SQL to learn more.

with option_context(
        "compute.ops_on_diff_frames", False,
        "compute.default_index_type", 'distributed'):
    df = ks.range(10)
    df = df + df
    df.explain()

Now it uses the same DataFrame for the operations and avoids combining different DataFrames and triggering a sort merge join, which is enabled by compute.ops_on_diff_frames.

== Physical Plan ==
*(1) Project [...]
+- *(1) Project [...]
   +- *(1) Range (0, 10, step=1, splits=12)

This operation is much cheaper than the previous one while producing the same output. Examine DataFrame.explain() to help improve your code efficiency.

Caching DataFrame

DataFrame.cache() is a useful PySpark API and is available in Koalas as well. It is used to cache the output from a Koalas operation so that it would not need to be computed again in the subsequent execution. This would significantly improve the execution speed when the output needs to be accessed repeatedly.

with option_context("compute.default_index_type", 'distributed'):
    df = ks.range(10)
    new_df = (df + df).cache()  # `(df + df)` is cached here as `df`
    new_df.explain()

As the physical plan shows below, new_df will be cached once it is executed.

== Physical Plan ==
*(1) InMemoryTableScan [...]
   +- InMemoryRelation [...], StorageLevel(...)
      +- *(1) Project [...]
         +- *(1) Project [...]
            +- *(1) Project [...]
               +- *(1) Range (0, 10, step=1, splits=12)

InMemoryTableScan and InMemoryRelation mean the new_df will be cached – it does not need to perform the same (df + df) operation when it is executed the next time.

A cached DataFrame can be uncached by DataFrame.unpersist().

new_df.unpersist()

Best Practice: A cached DataFrame can be used in a context manager to ensure the cached scope against the DataFrame. It will be cached and uncached back within the with scope.

with (df + df).cache() as df:
    df.explain()

Conclusion

The examples in this blog demonstrate how easily you can migrate your pandas codebase to Koalas when working with large datasets. Koalas is built on top of PySpark, and provides the same API interface as pandas. While there are subtle differences between pandas and Koalas, Koalas provides additional Koalas-specific functions to make it easy when working in a distributed setting. Finally, this blog shows common workarounds and best practices when working in Koalas. For pandas users who need to scale out, Koalas fits their needs nicely.

Get Started with Koalas on Apache Spark

You can get started with trying examples in this blog in this notebook, visit the Koalas documentation and peruse examples, and contribute at Koalas GitHub. Also, join the koalas-dev mailing list for discussions and new release announcements.

References

출처 :

https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html

'서버 > Python' 카테고리의 다른 글

jupyter themes 주피터 노트북 테마 바꾸기 (0)	2020.06.24
couchdb basic tutorial (0)	2020.04.26
tistory api access token 얻는 방법 (0)	2020.04.25
gsm bts using rasberry pi 3 (0)	2019.03.04
pwn basic in python (0)	2018.12.11

:

SQL, Group by 할 때 헷갈리길래 확인함

데이터사이언스 | 2020. 4. 23. 17:41 | Posted by youGom

#__table__
col1 col2
1	9
1	9
1	9
2	8
... ...

#1
select 
	col1, 
	col2
From 
	__table__
group by 1

#2
select 
	col1, 
	col2
From 
	__table__
group by col2

__table__ 에 대해 group by를 했을 때,

group by 1이 어느것을 지칭하는지 헷갈린다.

그래서 확인해보았다.

결론은 동일위치선상에 있는 쿼리문의 select의 대상자를 지칭하는 것이다.

만약 select a, b, c가 있다면, group by 1은 a를 지칭하는 것이다.

이 내용을 정확히 알고 있다면, 상관없지만, 어떤 쿼리는 대상 컬럼이름을 적어주는데 적어놓지 않고 번호로 지칭하는 경우가 종종 있기 때문에 알아두면 유용하다.

아래 처럼 id가 첫번째 컬럼이고, group by 1 로 첫번째이기 때문에 이런 경우에 특히 헷갈린다.

아래와 같이 결과가 나오는 것을 보면 1번째인 id값을 토대로 agg된다.

WITH ex ( id, val) as (
    VALUES
    (1, 9),
    (1, 9),
    (1, 9),
    (2, 8),
    (3, 8),
    (4, 8),
    (5, 8)
) 

select id, count(1) as cnt from ex
group by id
-- group by 1  <<-- id 또는 1을 입력하는 결과는 같다.

-------------

id	cnt
3	1
1	3 <<<-----
2	1
4	1
5	1

다른 예제로 두번째 컬럼인 val을 기준으로 하면 아래와 같이 출력된다.

이 내용으로 컬럼의 순서가 아닌 select에 명시된 순서 기준으로 group by 1이 동작하는 것을 알 수 있다.

WITH ex ( id, val) as (
    VALUES
    (1, 9),
    (1, 9),
    (1, 9),
    (2, 8),
    (3, 8),
    (4, 8),
    (5, 8)
) 

select val, count(1) as cnt from ex
group by val
-- group by 1 <<-- val 또는 1을 입력했을 때 동일한 결과가 나온다.

---------------

val	cnt	
9	3
8	4

이 내용을 통해 group by 1, 2, ... 에 대해 헷갈려서 발생하는 오류는 피하면 좋겠다.

'데이터사이언스' 카테고리의 다른 글

강화학습 스터디 자료 (0)	2020.07.01
pandas에서 사용하기 더 좋은 plot ( Plotting in Pandas Just Got Prettier ) (0)	2020.06.18
LSTM 이해하기 (0)	2020.05.20
카카오 아레나 대회 - 브런치 사용자를 위한 글 추천 (0)	2020.04.25
카카오 아레나 대회 - 쇼핑몰 상품 카테고리 분류 (0)	2020.04.25

:

[CouchDB] ruby에서 couchdb 잘 사용하기

서버/BigDB | 2013. 10. 28. 15:21 | Posted by youGom

CouchDB 레퍼런스 : http://rubydoc.info/github/couchrest/couchrest/frames

슬라이드 출처; http://www.slideshare.net/timanglade/couchdb-ruby-youre-doing-it-wrong

CouchDB & Ruby: You’re Doing it Wrong from Tim Anglade

저작자표시 비영리 변경금지 (새창열림)

'서버 > BigDB' 카테고리의 다른 글

[CouchDB] 5984 port open, access couchdb (0)	2013.10.27
[CouchDB] CouchDB on NodeJS, cradle (0)	2013.10.27
[CouchDB] Map/Reduce 개념잡기 (0)	2013.10.25

:

[CouchDB] Map/Reduce 개념잡기

서버/BigDB | 2013. 10. 25. 14:48 | Posted by youGom

기본 개념 잡기 좋은 슬라이드 자료!

저작자표시 비영리 변경금지 (새창열림)

'서버 > BigDB' 카테고리의 다른 글

[CouchDB] ruby에서 couchdb 잘 사용하기 (0)	2013.10.28
[CouchDB] 5984 port open, access couchdb (0)	2013.10.27
[CouchDB] CouchDB on NodeJS, cradle (0)	2013.10.27

:

[책] BackTrack 4 공포의 툴

책/독서후정리 | 2013. 9. 3. 15:21 | Posted by youGom

BackTrack

샤킬 알리, 테디 헤리얀토 저 |민병호 역 |에이콘출판 |2011.07.29

원제BackTrack 4

페이지 433|ISBN 9788960772168|판형 B5, 188*257mm

백트랙 4 보는중에 정리해두면, 나중에 편할 것 같아성. 대충~ 정리~ ^^;

(1) 정보 수집

* nmap

포트 스캔으로 아주 유명하다~

간단히 명령어 적으면 아래 처럼 활용, 뒷부분에 -p 80 이라고 적으면 80번 포트만 스캔함.

# nmap -sT 10.1-255,0.10

옵션으로는

1. -sT : TCP통신 확인, 핸드쉐이크

2. -sU : UDP 통신 확인

3. -sN : 널 패킷

4. -sS : 스텔스로 확인

추가 적인건 헬프로 확인해서 사용!

* unicornscan : 정보 수집 및 상관관계 엔진 툴

TCP/IP 장치에 자극을 준 다음, 응답 측정할 때 유용.

아래와 같이 활용

# unicornscan -m U -Iv 10.10.0.1/24:1-65535

패킷전송률을 수정해서 활용한다. 기본으로 할 경우, 너무 많은 시간이 걸린다.

# unicornscan -r 100000 -m U -Iv 10.10.0.1/24:1-65535

이렇게 하면 스캔이 훨씬 빨리 수행되는 걸 확인할 수 있다.

* zenmap

GUI로 제공된다.

두 스캔을 비교할수 있고,

토폴로지로 출력되어서 눈으로 확인하기 좋다. 토폴로지에 해당 컴퓨터 이름이 출력된다.

Profile에서 'Regular scan'을 선택해서 활용한다.

* amap : 서비스 탐색

아래와 같이 입력해서 사용한다.

# amap -bq 10.0.2.100 80 3306

80번 포트와 3306번 포트에서 실행중인 서비스를 찾아낸다.

* httprint : 핑거프린팅 툴 : 구매자 정보를 통해 불법 유포 추적하는 핑거프린팅의 정의 .디지털 자산에 사용자에 대한 정보를 은닉함으로써 출력물이나 디지털 자산으로부터 유출자에 대한 정보를 추출하여 불법행위를 추적하게 하는 기술

아래와 같이 입력하면 웹서버의 시그니쳐를 받아온다. 완전히 정확하진 않지만 꽤 정확한 추측을 해낸다.

#./httprint -h 10.10.0.1 -s signature.txt

* httsquash : http 서비스를 스캔하고, 배너 수집, 데이터 추출하는 툴이다.

# ./httsquash -r 10.10.0.2

* ike-scan : IPSec VPN 시스템 발견, 핑거프린팅, 테스트 보안 툴

IKE = Internet Key Exchange는 IPSec에서 사용되는 키 교환 겸 인증 기법이다.

- 임의개수의 목적지에 호스트로 IKE패킷 전송( 다양한 방법으로 구성가능 )

- 응답 패킷을 디코딩한 후 화면에 출력할 수 있음

- psk-crack툴을 이용해 aggressive mode의 사전 공유기( pre-shared key ) 크랙할 수 있음

아래와 같은 명령어로 서버 발견

# ike-scan -M -v 192.168.109.99

위 명령어 결과로 암호화방식( 3DES ), 해시( SHA1 ), 인증( PSK ), Diffie-Hellman 그룹 ( 2 ), SA 유효기간 ( 28800 초 ) 확인

SA 페이로드 정보 획득한 후 VPN 서버를 핑거프린팅 성공 할 때까지 변환속성을 바꿔가며 시도.

" http://www.nta-monitor.com/wiki/index.php/Ike-scan_User_Guide#Trying_Different_Transforms "로 변환속성 확인

# ike-scan -M --trans=5,2,1,2 --showbackoff 192.168.10.99

아쉽게도 위 명령어로 핑거프린팅하진 못했다. 그럼 속성을 변환하면서 하면 되지 않을까? ^^;

(2) 취약점 찾기

* ping : ICMP 프로토콜에 ECHO REQUEST 패킷을 전송한다.

# ping -c 2 -s 1000 10.0.2.2 ( ECHO 패킷수 2, 패킷크기 1000 )

* arping : 목적지 호스트가 LAN에 위치할 때 주소결정프로토콜( ARP : Address Resolution Protocol )요청하여 목적지를 확인

- OSI Layer 2( 네트워크 계층 )에서 동작하며 로컬 네트워크에서만 사용가능

- ARP는 라우터나 게이트웨어 밖으로 라우팅될 수 없다.

- 응답 값으로 해당 IP의 MAC주소를 알 수 있다.

# arping -c 3 10.0.2.2

* arping2 : 타겟 호스트에 ARP or ICMP 요청 전송하는 툴

- bt4에는 백트랙 메뉴에 없으나 /pentest/misc/arping/arping2에 존재한다.

# ./arping2 -c 3 192.168.1.1

# ./arping2 -c 3 00:17:16:02:b6:b3

* fping

* genlist

* hping2

* hping3

* lanmap

* bping : 다양한 프로토콜( TCP, UDP, ICMP, ARP ) 네트워크 패킷생성할 수 있게 해주는 최신 툴

- 핑과 유사하게 호스트 탐지 가능, 네트워크 스택 스트레스 테스트, ARP오염( posisoning), 서비스 거부 등의 목적 사용 가능

- bt4 에서는 nmap 패키지에 포함

- 아래 명령어로 사용한다. SYN 플래그를 설정한( --flags SYN ) 하나의 TCP 패킷 ( --tcp -c 1 )을 IP 주소 10.0.20.100의 목적지 포트 22( -p 22 )로 전송한다.

#nping -c 1 --tcp -p 22 -flags syn 10.0.20.100

* onesixtyone : 간이 망 관리 프로토콜( SNMP ) 스캐너로 장비에 SNMP 문자열이 존재하는지 조사

- SNMP : Simple Network Monitoring Protocol

- 아래 명령어로 사용

# onesixtyone 192.168.1.1

# onesixtyone -d 192.168.1.1 ( 더 자세히 알고 싶다면 -d 옵션을 준다. )

* p0f : OS핑거프린팅; 타겟머신의 운영체제를 알아내는 것. 능동/수동 두가지 방식이 있다.

- 수동으로 핑거프린팅할 때 사용

- 아래 명령어로 실행하면 log파일에 기록이 남으며, 기록하는 중에 TCP연결을 수바나는 네트워크 활동을 생성해야 한다.

- 아래 명령어 입력하여 사용

#p0f -o p0f.log

* xprobe2 : OS핑거프린팅 ( 능동 방식 )

- 퍼지 시그니쳐 매칭, 확률적 추측, 동시다발적 매칭, 시그니쳐 데이터베이스를 사용해서 운영체제를 알아낸다.

- 포함된 모듈 : icmp_ping, tcp_ping, udp_ping, ttl_calc, portscan, icmp_echo, icmp_tstamp, icmp_amask, icmp_port_unreach, tcp_hshake, tcp_rst

- 아래 명령어로 사용

# xprobe2 10.0.2.100

- 만약 추측한 결과가 잘못되었을 경우, 데이터베이스를 최신버젼으로 갱신하지 않았기 때문.

저작자표시 비영리 변경금지 (새창열림)

'책 > 독서후정리' 카테고리의 다른 글

[책] 해커의 언어, 치명적 파이썬 (0)	2014.01.09
[포렌식] 구글 안드로이드 플랫폼 분석과 모바일 보안 (0)	2013.09.30
[책][서버] jBoss, 빠르게 활용하는 jBoss5 (0)	2013.07.01
[책] 변신, 프란츠 카프카 (0)	2013.06.09
[책] 가끔은 제정신 (0)	2013.05.27

:

자료 천국 빗스눕

재밌는 흔적/즐겨찾기 - 사이트 | 2012. 3. 26. 23:00 | Posted by youGom

여기로 가서 자료 한번 찾아봐,

토렌트로 바로 받을 수도 있음.

http://bitsnoop.com

저작자표시 비영리 변경금지 (새창열림)

'재밌는 흔적 > 즐겨찾기 - 사이트' 카테고리의 다른 글

해외 강좌 사이트 ( 무료 ) (0)	2013.03.18
환률, 경제 캐쉬플로우와 관련된 사이트 (0)	2012.08.13
토렌토 사이트 트래커 ( 모음 ) (0)	2012.03.21
google, go 언어 참조 사이트 (0)	2012.02.14
부동산 관련 (0)	2012.01.31

:

우리는 직장인, 프로그래머란 자부심은 집에 두고 출근한다.

소프트웨어 공학/개발 | 2012. 2. 15. 14:52 | Posted by youGom

그냥, 경함담..(?) ㅋ

딱 봐도, 나중에 모두 고생길로 가는 지시였다.
그럼에도 불구하고, 그냥 하는거다. ^^;

UI부분과 Network 리소스 관리에 대한 이야기다.

UI에서 화면에 출력할 서버 데이터가 필요해 Data network 모듈에 Request를 호출한다.

너무 많이 호출( 100개 )하면 Request Queue가 Full나서 뻑난다.

그리하여, UI에서 Request 호출을 Pooling해서 Network 단에 하나씩 넘겨주라고 한다.

( 즉, 10개의 Request가 있으면, 1번 끝나면 2번 보내기 ... )

왜.. Network의 Sender( Request ) Pooling관리를 UI에서 하는지 모르겠다.

죽는 건 Network단인데, 고치는 건 UI다.

현재 1개의 UI에서 아랫단을 대신하여 만들고 있지만, 이거 하나로 모든 UI 페이지가 이 내용을 따라야 할 것이다.

약 30개 이상의 UI가 존재한다. Network에서 Pooling을 지원해주면 될 것을..

( 100줄에 끝날 내용을.. 200줄에서 최대 1만줄을 할애하는 것이다. )

이에 대한 건의는 하지 않는다. 윗분들은 제안을 좋아하지 않기 때문이다. 제안을 좋아하는 경우는 단 하나, 그 사람이 내고 싶어했던 제안을 내야 좋아한다.

그냥.. 닥치고 하라는 대로 한다.

위 지시에 대한 응답은 "알겠습니다"였다.

나는..? 우리는..! 프로그래머가 아니라, 직장인이다.

ps. MVC보다는 Model2 Architecture와 가장 유사하다.
( 설계도는 없다. 그냥.. 아키텍쳐로 따지자면 그렇다는 것.. ^^;; 이해를 돕기 위해 비유했다. ^^; )

닥치고 정치를 하던가, 닥치고 하라는대로 하던가. 골라야 할듯 ^^;;

저작자표시 비영리 변경금지 (새창열림)

'소프트웨어 공학 > 개발' 카테고리의 다른 글

[Env] 라즈베리파이 - 라즈비안 설치 및 환경설정 (0)	2018.04.11
[Code] Reflection 구현 가이드 코드 ( C++ ) (2)	2012.10.18
개발하다가.. 미운 사람 있으면 써 보세요 ^^;; (0)	2011.12.27
svn: Item is not readable (아이템이 읽기 불가능합니다) 해결 (0)	2011.11.08
svn 데이터 dump 및 load 방법 (0)	2011.11.07

:

Tags»

Archive»

Category»

Recent Comment»

Recent Post»

Recent Trackback»

My Link»

'data'에 해당되는 글 7건

Distributed and Partitioned Koalas DataFrame

Migration from pandas to Koalas

Object Creation

Viewing Data

Selecting or Accessing Data

Applying a Python Function to Koalas DataFrame

Grouping Data

Plotting and Visualizing Data

Missing Functionalities and Workarounds in Koalas

Using pandas APIs via Conversion

Native Support for pandas Objects

Distributing a pandas Function in Koalas

Using SQL in Koalas

Working with PySpark

Conversion from and to PySpark DataFrame

Checking Spark’s Execution Plans

Caching DataFrame

Conclusion

Get Started with Koalas on Apache Spark

References

'서버 > Python' 카테고리의 다른 글

'데이터사이언스' 카테고리의 다른 글

'서버 > BigDB' 카테고리의 다른 글

'서버 > BigDB' 카테고리의 다른 글

'책 > 독서후정리' 카테고리의 다른 글

'재밌는 흔적 > 즐겨찾기 - 사이트' 카테고리의 다른 글

'소프트웨어 공학 > 개발' 카테고리의 다른 글

티스토리툴바